Implementing Temporal Tables Where You Have Existing Data

October 24, 2017October 24, 2017 ~ Matthew McGiffen ~ 10 Comments

In my post about auditing tools in SQL Server I mentioned a few tools we have at our disposal for capturing the change history of data.

One was Change Data Capture which I want into a bit more detail about in Setting up Change Data Capture (CDC)

I also referred to Temporal Tables which are available to us in the brave new world of SQL 2016.

Temporal tables aren’t just about monitoring change, they also provide really nice methods for being able to query historical data – to see what the values were at a particular point in time e.g.

SELECT * FROM dbo.SomeData FOR SYSTEM_TIME AS OF '1900-01-01';

This makes them easy to work with, but also opens up other applications for them over and above auditing purposes – for instance being able to maintain configuration data that changes over time, but where we need to reference what the value was historically.

I wanted to look in more detail about the process for changing an existing table so that it becomes a temporal one.

Let’s start by quickly looking at how we would define a temporal table for a fresh deployment:

CREATE TABLE dbo.SomeData(
   Id INT IDENTITY(1,1) NOT NULL
      CONSTRAINT PK_SomeData PRIMARY KEY CLUSTERED,
   SomeText VARCHAR(1000) NULL,
   ModifiedBy VARCHAR(128) NULL,
   StartDate datetime2(7) GENERATED ALWAYS AS ROW START NOT NULL,
   EndDate datetime2(7) GENERATED ALWAYS AS ROW END NOT NULL,
   PERIOD FOR SYSTEM_TIME (StartDate, EndDate)
)
WITH (SYSTEM_VERSIONING = ON (HISTORY_TABLE = dbo.SomeData_History));

There are a few things here additional to a normal table definition. First thing to note is we need two columns to support the Temporal versioning. I’ve called my columns StartDate and EndDate, these both have to be of DATETIME2 datatype to the highest precision (100 nanoseconds). These are specified as GENERATED ALWAYS AS ROW START\END so SQL will take care of assigning values when new rows are inserted or updated.

I’ve specified the PERIOD FOR SYSTEM TIME, so SQL knows that these are the fields to use for the temporal versioning.

Finally I’ve specified SYSTEM_VERSIONING for the table and what the history table should be called.

When I execute this you can find the table and corresponding history table displayed in SSMS as below:
Tempopral1
We have the table itself, and also the history table has been created for us. This is where the previous versions of rows of data will be stored.

So that’s pretty straightforward – but what if you want to change an existing table?

First of all let’s drop this one:

DROP TABLE dbo.SomeData;

Unfortunately that gives me an error message:
Msg 13552, Level 16, State 1, Line 90
Drop table operation failed on table ‘TestAudit.dbo.SomeData’ because it is not supported operation on system-versioned temporal tables.

As the message says, you can’t just drop Temporal tables. First you must turn system versioning off, then you can go ahead:

ALTER TABLE dbo.SomeData SET (SYSTEM_VERSIONING = OFF); 
DROP TABLE dbo.SomeData;

Dropping the table though leaves the History table behind – now seen as a normal table: Tempopral2

That’s handy as it means you don’t lose the history when the main table is deleted. For now though let’s just delete that as well:

DROP TABLE dbo.SomeData_History;

Now let’s create a new table that will be our “existing table” we want to change. For this example I’ll create a test table with a trigger that maintains a modified date, and throw in 1000 rows:

CREATE TABLE dbo.SomeData(
   Id INT IDENTITY(1,1) 
      CONSTRAINT PK_SomeData PRIMARY KEY CLUSTERED NOT NULL,
   SomeText VARCHAR(1000) NULL,
   ModifiedBy VARCHAR(128) NULL,
   ModifiedDate DATETIME NOT NULL 
      CONSTRAINT DF_SomeData_ModifiedDate DEFAULT (GETUTCDATE())
);
GO

CREATE TRIGGER dbo.TR_SomeData_ModifiedDate
   ON dbo.SomeData 
   AFTER UPDATE
AS
BEGIN
   UPDATE dbo.SomeData 
   SET ModifiedDate = GETUTCDATE()
   FROM dbo.SomeData 
   INNER JOIN Inserted
      ON dbo.SomeData.Id = Inserted.Id;
END;
GO

INSERT INTO dbo.SomeData(SomeText, ModifiedBy)
SELECT TOP 1000 'Blah', 'Me'
FROM sys.objects a CROSS JOIN sys.objects b;

I’m going to complicate things and say that when I turn this into a temporal table I want to keep the existing ModifiedDate column and use that as the start date for my system versioning.

The first thing you might try is to alter the column:

ALTER TABLE dbo.SomeData
ALTER COLUMN ModifiedDate DATETIME GENERATED ALWAYS AS ROW START NOT NULL;

Error:
Msg 13589, Level 16, State 1, Line 44
Column ‘ModifiedDate’ in table ‘TestAudit.dbo.SomeData’ cannot be specified as ‘GENERATED ALWAYS’ in ALTER COLUMN statement.

So I can’t do this in an alter statement. What I’ll actually have to do is create a new column specified to work with the system versioning, copy the data over, drop the old column and rename the new one.

First let’s create the additional columns and define the PERIOD FOR SYSTEM_TIME:

ALTER TABLE dbo.SomeData
ADD 
StartDate datetime2 GENERATED ALWAYS AS ROW start NOT NULL
   CONSTRAINT DF_SomeData_StartDate DEFAULT (GETUTCDATE()),
EndDate datetime2 GENERATED ALWAYS AS ROW END NOT NULL
    CONSTRAINT DF_SomeData_EndDate DEFAULT ('9999-12-31 23:59:59.9999999'),
PERIOD FOR SYSTEM_TIME (StartDate, EndDate);

Note that I’ve had to specify defaults for the new columns so that the existing data gets values populated (I didn’t have to do that when I simply defined the Temporal Table from scratch).

For the StartDate I’ve specified the current UTC date – this could be any date and time that’s not in the future, though note it should be a UTC one. If I’d tried to use GETDATE, as I’m currently on British Summer Time, I would get the following error:
Msg 13542, Level 16, State 0, Line 51
ADD PERIOD FOR SYSTEM_TIME on table ‘TestAudit.dbo.SomeData’ failed because there are open records with start of period set to a value in the future.

For the end date I have put the maximum value a DATETIME2 column can hold, this is required and is a way of saying that this record is (at the moment) valid indefinitely. If I try to put in an earlier date I receive this error:
Msg 13575, Level 16, State 0, Line 51
ADD PERIOD FOR SYSTEM_TIME failed because table ‘TestAudit.dbo.SomeData’ contains records where end of period is not equal to MAX datetime.

Now I’ve got the columns I could try copying my data over:

UPDATE dbo.SomeData SET StartDate = ModifiedDate;

I get another error:
Msg 13537, Level 16, State 1, Line 70
Cannot update GENERATED ALWAYS columns in table ‘TestAudit.dbo.SomeData’.

It turns out you can get around this by temporarily dropping the period for system time:

ALTER TABLE dbo.SomeData DROP PERIOD FOR SYSTEM_TIME;
UPDATE dbo.SomeData SET StartDate = ModifiedDate;
ALTER TABLE dbo.SomeData ADD PERIOD FOR SYSTEM_TIME (StartDate, EndDate);

Your logical thought might be here that I didn’t need to create the period for system time when I altered the table to add the new columns. I could have just added the columns, migrated the data, then only added the period afterwards. Trust me, I tried that too:
Msg 13509, Level 16, State 1, Line 51
Cannot create generated always column when SYSTEM_TIME period is not defined.

From here it turns out to be reasonable clear sailing.

Let’s run the final command to make the table Temporal:

ALTER TABLE dbo.SomeData 
SET (SYSTEM_VERSIONING = ON (HISTORY_TABLE = dbo.SomeData_History));

Here’s what it looks like in SSMS:
Tempopral3
You can see this is the same as the original fresh creation of a temporal table. The history table has been created for us. Just with an additional column in both tables.

Let’s remove the extra column and do the final rename:

ALTER TABLE dbo.SomeData DROP CONSTRAINT DF_SomeData_ModifiedDate;
ALTER TABLE dbo.SomeData DROP COLUMN ModifiedDate;
EXEC sp_rename 'dbo.SomeData.StartDate','ModifiedDate','COLUMN';

I was amazed the sp_rename didn’t give me an error. In fact if we look at the table again:
Tempopral4
You can see the column has been renamed in both the main table and the history table. Pretty neat!

I have at this stage forgotten to do one thing. Let’s just try to update a record and see what happens:

UPDATE dbo.SomeData SET SomeText = 'Wibble' WHERE Id = 2;

Msg 13537, Level 16, State 1, Procedure TR_SomeData_ModifiedDate, Line 7 [Batch Start Line 185]
Cannot update GENERATED ALWAYS columns in table ‘TestAudit.dbo.SomeData’.
The statement has been terminated.

Yep, don’t forget to remove your trigger!

DROP TRIGGER dbo.TR_SomeData_ModifiedDate;

And now we’re all done.

What if I use SQL Server Data Tools (SSDT) to make my change?

If you’re working with Database Projects in Visual Studio then you will be used to publishing changes straight from there – or deploying them using a DACPAC file.

In that case SSDT handles all this for you – though it does use a slightly different method. Rather than creating a new column, copying that data over, dropping the old column and renaming the new one, it does it at the table level.
That’s the same way some other changes are made with SSDT (or if you do them through the GUI in SSMS). The steps are broadly the same, just with tables instead of columns:

1) Create new temporal table
2) Turn SYSTEM_VERSIONING off
3) Copy across the data from the old table to the new
4) Turn SYSTEM_VERSIONING back on
5) Drop the old table
6) Rename the new table to the original name

If you’re doing this by “hand”, you could do it this way if you prefer, but it is probably a heavier operation that will require more down time. For instance you’ll need the whole table to be locked while the operation progresses as you don’t want to risk data changes once you’ve started copying the data over. The column method will also lock the table, but potentially there is less to update\copy so it should hopefully be quicker.

As a final point I should note that you may run into problems dropping the trigger with SSDT, it can be a little quirky around doing that. For the moment I’ll leave off from going into that in more detail – but if you get that issue feel free to give me a yell!

Bug with STATISTICS TIME?

October 17, 2017October 17, 2017 ~ Matthew McGiffen ~ Leave a comment

I love the STATISTICS IO and STATISTICS TIME commands. They are such a powerful and easy way to be able to measure improvements when performance tuning (Measuring SQL Query Performance).

You have to be a little bit wary though, there are a few places where they don’t report figures correctly such as this bug (fixed from 2014):
https://connect.microsoft.com/SQLServer/feedback/details/767250/statistics-io-under-reports-logical-reads-for-parallel-plans

Also anything to do with a MSTVF (multi-statement table value function) is likely to be reported incorrectly (another reason not to use such functions!).

Today I was doing some testing to compare different methods of auditing data, and I came up against another (bug?) I hadn’t seen before. So I thought I’d share.

This is running on SQL 2016 – I haven’t checked it on earlier versions.

I’ll create two identical tables for comparison, identical except one has a trigger against it that inserts changes to an audit table.

CREATE DATABASE StatisticsBug;
USE StatisticsBug;

CREATE TABLE VanillaTable
(
Id INT IDENTITY(1,1) CONSTRAINT PK_VanillaTable PRIMARY KEY CLUSTERED,
SomeText VARCHAR(1000),
ModifiedBy VARCHAR(128),
ModifiedDate DATETIME
);

CREATE TABLE TableWithTrigger
(
Id INT IDENTITY(1,1) CONSTRAINT PK_TableWithTrigger PRIMARY KEY CLUSTERED,
SomeText VARCHAR(1000),
ModifiedBy VARCHAR(128),
ModifiedDate DATETIME
);

CREATE TABLE TableWithTrigger_Audit
(
Id INT,
SomeText VARCHAR(1000),
ModifiedBy VARCHAR(128),
ModifiedDate DATETIME,
AuditId INT IDENTITY(1,1) CONSTRAINT PK_TableWithTrigger_Audit PRIMARY KEY CLUSTERED,
AuditDate DATETIME
);

CREATE TRIGGER TableWithTrigger_trigger
ON dbo.TableWithTrigger
after INSERT, UPDATE, DELETE
AS
BEGIN
INSERT INTO dbo.TableWithTrigger_Audit(Id, SomeText, ModifiedBy, ModifiedDate, AuditDate)
SELECT Id, SomeText, ModifiedBy, ModifiedDate, GETDATE()
FROM deleted;

INSERT INTO dbo.TableWithTrigger_Audit(Id, SomeText, ModifiedBy, ModifiedDate, AuditDate)
SELECT Id, SomeText, ModifiedBy, ModifiedDate, GETDATE()
FROM inserted;
END;

Now I’m going to insert a bunch of rows into each table and capture the CPU output so we can compare. Obviously I expect the one with the trigger to have to do more work, I’m just demonstrating both to show that the first example works correctly, but the second one seems a bit more dubious.

First the table without the trigger:

--Insert a million rows into table without trigger, capture CPU
SET STATISTICS time ON;

INSERT INTO dbo.VanillaTable(SomeText, ModifiedBy, ModifiedDate)
SELECT TOP 1000000  'SomeTextSomeTextSomeTextSomeTextSomeTextSomeTextSomeTextSomeTextSomeTextSomeText', 'Matt', GETDATE()
FROM sys.objects a, sys.objects b, sys.objects c, sys.objects d;

Let’s look at the output:
SQL Server parse and compile time:
CPU time = 31 ms, elapsed time = 64 ms.
SQL Server Execution Times:
CPU time = 4766 ms, elapsed time = 5596 ms.

That all seems quite reasonable and what we’d expect to see.

Now let’s repeat the same test for our table with a trigger:

--Insert a million rows into table with trigger, capture CPU
SET STATISTICS time ON;

INSERT INTO dbo.TableWithTrigger(SomeText, ModifiedBy, ModifiedDate)
SELECT TOP 1000000  'SomeTextSomeTextSomeTextSomeTextSomeTextSomeTextSomeTextSomeTextSomeTextSomeText', 'Matt', GETDATE()
FROM sys.objects a, sys.objects b, sys.objects c, sys.objects d

And look again at the CPU:
SQL Server parse and compile time:
CPU time = 31 ms, elapsed time = 31 ms.
SQL Server parse and compile time:
CPU time = 6031 ms, elapsed time = 8782 ms.
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.
SQL Server Execution Times:
CPU time = 7266 ms, elapsed time = 9205 ms.

SQL Server Execution Times:
CPU time = 13297 ms, elapsed time = 17992 ms.

Overall the execution takes over twice as long, which I expect as – due to my trigger – I’m now inserting to two tables. This is represented by the last entry which is the total execution time.

I want to draw your attention though to the second set of parse and compile time figures (highlighted in bold). These seem to be saying that to compile the query took over 6 seconds of CPU. I find this highly suspect.

This is also highlighted by the fact that the final execution time is actually the sum of 6031 (which is reporting as parse and compile) and 7266.
So my belief was that the output was incorrectly reporting execution time as compile time.

I captured the actual execution plan for the query to check the compile time. In the plan XML you can find the element CompileCPU (measured in ms).

As there are three statements covered by the plan (the initial insert plus two more in the trigger) I find three values:

Statement 1:

CompileCPU1

Statement 2:

CompileCPU2

Statement 3:

CompileCPU3

28ms, 1ms and 1ms for a total of 30ms. That’s actually pretty close to the initial figure reported in the statistics output, any difference is probably just rounding:
SQL Server parse and compile time:
CPU time = 31 ms, elapsed time = 31 ms.

That seems double confirmation to me that the output should be reporting the 6031ms and 8782ms not as CPU and Elapsed for Parse and Compile, but rather CPU and Elapsed for Execution.

I’d be interested if anyone has any other explanation of this, or to know if you can repeat it. With a bit more confirmation I’ll raise a Connect item.

Thanks for reading!

Avoid “Constants” in T-SQL

October 10, 2017October 10, 2017 ~ Matthew McGiffen ~ 13 Comments

Unlike some other languages, T-SQL doesn’t have the concept of a constant.

As good coders, we’ve all at some point tried to use a SQL variable to hold a constant value within a stored procedure, in order to make our code both more readable and maintainable.

I’ll give you an example. Here I’m querying a Task table, and I want all the rows where the TaskStatus is 0 (zero) which means the Task is Open:

SELECT *
FROM dbo.Task 
WHERE TaskStatus = 0;

If someone else comes along to look at this code they don’t know what the value of zero means. My code’s not clear. Worse I might refer to the same value multiple times in my procedure, so if I need to change it later I have to change it in multiple places.

Good practice from other coding languages would say that I replace it with a meaningfully named constant. As mentioned, in T-SQL we don’t have constants so I’ll compromise and use a variable:

DECLARE @OpenTask tinyint = 0;

SELECT *
FROM dbo.Task 
WHERE  TaskStatus = @OpenTask;

Now that’s much more readable – right?

Unfortunately it’s also a bad idea in SQL . Let’s see why.

I’ll create the aforementioned Task table, add an index on TaskStatus. Then I’ll add million rows with status 1 (which we’ll call closed) and 1 row with the value 0 (zero) which is open:

CREATE TABLE dbo.Task
(
    Id INT IDENTITY(1,1) CONSTRAINT PK_Task PRIMARY KEY CLUSTERED,
    UserId INT,
    TaskType INT,
    Payload VARCHAR(255) NOT NULL,
    TaskStatus tinyint NOT NULL
);
GO

CREATE INDEX IX_Task_TaskStatus ON dbo.Task(TaskStatus);

INSERT INTO dbo.Task (UserId,TaskType,Payload,TaskStatus)
SELECT TOP 1000000 1,1,'This Shizzle Is Done',1
FROM sys.objects a, sys.objects b, sys.objects c;

INSERT INTO dbo.Task (UserId,TaskType,Payload,TaskStatus)
SELECT 1,1,'Do This Shizzle',0;

Once that’s completed I’m going to update the statistics just so we know SQL has the most up to date information to produce an optimal execution plan for our queries:

UPDATE STATISTICS dbo.Task WITH fullscan;

Now let’s go back to our original queries. Before I run them let’s think what we want them to do. We have an index on TaskStatus and we only have one row we are looking for, so we’d hope the query will use the index and go straight to the record. The index doesn’t contain all the columns, but that’s okay. We’re only going to have to output one record so if it has to look up the extra columns up in the clustered index that’ll be pretty damn quick.

Let’s run the first query, we’ll capture the execution plan and the STATISTICS output:

SET STATISTICS io ON;
SET STATISTICS time ON;

SELECT *
FROM dbo.Task 
WHERE TaskStatus = 0;

Here’s the execution plan:
Constants1

That’s doing exactly what we hoped for, it’s looked up the record in our index using a seek. Then it’s grabbed the rest of the columns from the clustered index using a key lookup.

Here’s the statistics output:
Table ‘Task’. Scan count 1, logical reads 7, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 1 ms.

So that’s a nice small number of reads and less than a millisecond of CPU.

Now let’s run the “improved” version:

SET STATISTICS io ON;
SET STATISTICS time ON;

DECLARE @OpenTask tinyint = 0;

SELECT *
FROM dbo.Task 
WHERE TaskStatus = @OpenTask;

Here’s the execution plan this time:

Constants2

That doesn’t look so good. Let’s check the statistics:
Table ‘Task’. Scan count 1, logical reads 5341, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

CPU time = 109 ms, elapsed time = 96 ms.

Those figures tell us the query has got between 100 and 1,000 times worse. So much for the improved version.

So why is this happening?

The answer is simply that the optimizer doesn’t/can’t look at the values inside variables when a piece of SQL is compiled. Therefore it can’t use the statistics against the indexes on the table to get an accurate idea of how many rows to expect back in the results.

We can see that if we compare the properties of the Index Seek Operator from the first query:

Constants3

Against the properties for the Index Scan Operator from the second query:

Constants4

In the first one we can see that the Actual Number of Rows (at the top) exactly matches the Estimated Number of rows (at the bottom). SQL has been able to use the statistics on the index to get an accurate estimate.

In the second this is not the case. We have 500,000 rows estimate, but only 1 actual. This has led SQL down the route of choosing a plan that would have been more effective for 500,000 rows – but is much less effective for 1. In this case it didn’t know what value to optimize for. Lacking that information it used the density value in the statistics and multiplied that by the total number of rows to get the estimate. Or in other words, the statistics tell it that there are two distinct values (0 and 1) in the table. Not knowing which one has been supplied the optimizer figures than on average half the rows will be returned.

So what should do you to make your code clearer?

The simple answer is to use comments, the following is totally clear to its meaning, and will perform optimally:

SELECT *
FROM dbo.Task 
WHERE TaskStatus = 0 -- Open Task;

But what about the maintainability issue, where you may have to refer to the same value multiple times in a given procedure?

Unfortunately you’re just going to have to put up with maintaining the value in multiple places, but in general within a well designed application these should be static values (and hopefully integers) so it shouldn’t be too big a deal.

Note this is not the same for parameters passed to a stored procedure. In that case the queries inside the stored procedure will compile using the values passed the first time the procedure was executed – that can be really useful, but it can also cause its own set of issues to be aware of! Just remember parameters and variables – similar but not the same!

Setting up Change Data Capture (CDC)

October 3, 2017September 13, 2021 ~ Matthew McGiffen ~ 4 Comments

As mentioned in my post Auditing Data Access in SQL Server for GDPR Compliance CDC can be a useful tool for capturing a change history for specific data – which is something we might be thinking about a bit more now the GDPR is heading our way.

It’s pretty straightforward to set up, and can be done easily against existing tables/data – which is handy.

As mentioned in the previous post though, you do need to be on Enterprise Edition or SQL 2016 SP1 or later to use it.

Let’s take a quick look at how you go about setting it up.

First let’s create a table that has some data we want to track:

IF OBJECT_ID('dbo.CDC') IS NOT NULL DROP TABLE dbo.CDC;
CREATE TABLE dbo.CDC 
(
    Id INT IDENTITY(1,1) CONSTRAINT PK_CDC PRIMARY KEY CLUSTERED, 
    SomeText VARCHAR(1000), 
    ModifiedBy VARCHAR(128), 
    ModifiedDate DATETIME
);

Next you have to enable CDC for the database:

EXEC sys.sp_cdc_enable_db;

Then you have to tell SQL to track changes for the specified table:

EXEC sys.sp_cdc_enable_table 
   @source_schema = N'dbo', 
   @source_name = N'CDC', 
   @role_name = NULL;

Note the final parameter @role_name in the above stored procedure. This allows you to specify a database role that will be used to limit access the change history. I’ve just set to this to Null which means that there is no additional role required. There are also other parameters you can specify which you will find defined here:
https://docs.microsoft.com/en-us/sql/relational-databases/system-stored-procedures/sys-sp-cdc-enable-table-transact-sql

Once you’ve followed those few steps CDC is up and running. The change table can be found under system tables in your database, in the CDC schema, and has the name of the original table with “_CT” appended:

CDC_ChangeTable

You can see some additional columns. Rather than capturing the time of the change we have the Start LSN which is the Log Sequence Number associated with the change. This isn’t always that useful to us, which is why is it important for you to capture the time yourself – i.e. the ModifiedDate column.

Also of interest to us is the _$operation column which tell us what operation was logged:
1 = delete
2 = insert
3 = update (old values)
4 = update (new values)

As you can infer from the above, when we do an update we will have a record for both the before and after value.

_$update_mask could be of use to us in some scenarios as it contains a bit mask based on the column ordinals of the table to show us which columns have changed.

Let’s quickly insert and update some data so we can see an example of what we get in the change table:

INSERT INTO dbo.CDC(SomeText, ModifiedBy, ModifiedDate)
SELECT 'SomeText', 'Matt', GETDATE();

UPDATE dbo.CDC 
SET 
SomeText = 'ANewValue', 
ModifiedBy = 'StillMatt', 
ModifiedDate = GETDATE();

SELECT * FROM cdc.dbo_CDC_CT;

Here’s the output from the select:

CDC_ChangeTable_data

You can see we have one row showing the initial insert (_$operation = 2), one row showing the values before the update (_$operation = 3) and one row showing the values after the update (_$operation = 4).

If you run this test yourself you might find it takes a few seconds for the changes to populate as there can be a lag.

So how does it work?

CDC works by parsing your transaction log. It looks for changes relating to the tables is it enabled for and then writes those changes off to the change tracking table. Doing things this way means that it doesn’t have to be a synchronous operation, i.e. it can process the transaction log well after your transaction has committed. So your DML (update/insert/delete) operation is not delayed.

The downside of that is that it does have to go through the whole log (though just the part since it last ran) and work out what has occurred, when we look at performance we’ll see that this seems to create additional work over other synchronous change capture operations. However that work happens in the background and can be limited by the settings you use to configure the CDC jobs, so it may be acceptable or even preferable to a direct increase in the transaction time for your data modifications. Like anything in life it’s a trade-off and you’ll need to decide what works best for you.

Other issues can be around log activity and size. Implementing CDC means that you will get increased read activity on your I/O subsystem where the log is stored for the given database(s). So if implementing CDC it will be good to monitor this to make sure your disks can cope. Also the logfiles themselves are likely to get bigger. This latter point is for two reasons, one is that the CDC capture is itself a logged process, the other is that operations that would normally truncate the log or allow space to be re-used cannot do so until CDC has processed that section of the log. So if latency builds up in the CDC capture operations log entries are likely to be retained for longer.

CDC is actually managed by a couple of SQL Agent jobs. Two for each database that has CDC enabled. It’s useful to understand these for a few reasons. First of all, obviously if SQL Agent is not enabled then CDC is not going to work. Also you don’t want someone wondering what these jobs are and disabling them. Finally you may find you want to perform additional configuration – either for performance or data retention.

The jobs are:

cdc.{DatabaseName}_capture

This does the actual work by calling a series of system stored procedures that do the heavy lifting. The job doesn’t have a schedule, rather it is set running when you enable CDC and then generally continues based on the configuration.

There are 4 parameters that can control the capture:

@continuous – This is a bit value. 0 specifies that the capture will run once and then the job will terminate – this is only really useful in testing scenarios. 1 means that the job will run continuously which is the normal operation.

@maxtrans – This determines how many transactions will be read from the log and written to the change tables. This process is referred to as a scan cycle. Note this is done in a single transaction.

@maxscans – This sets how many scan cycles will be carried out before the job pauses.

@pollinginterval – This is the amount of seconds the job will pause before running through the specified number of scan cycles once more.

When you enable CDC these parameters get set to defaults – @continous = 1, @maxtrans = 500, @maxscans = 10 and @pollinginterval = 5.

You can change any of these setting using the sys.sp_cdc_change_job stored procedure. Changes will be made for the job that relates to the current database context. You need to specify that you want to change the capture job, and new values for the parameters you want to change.

So, if I wanted to change @maxtrans to 1000 and increase the polling interval to 10 seconds I would run the following:

EXECUTE sys.sp_cdc_change_job   
    @job_type = N'capture',  
    @maxscans = 1000,  
    @pollinginterval = 10;

The job must be restarted before the new settings take effect. You can do that using two more system stored procedures:

EXEC sys.sp_cdc_stop_job @job_type = 'capture';
EXEC sys.sp_cdc_start_job @job_type = 'capture';

Here is a Microsoft Whitepaper that goes into detail on tuning CDC using these parameters:
https://technet.microsoft.com/en-us/library/dd266396(v=sql.100).aspx

cdc.{DatabaseName]_cleanup

This is the other job involved in CDC that manages retention of the change data for you. If you are monitoring data that changes a lot then the change table can get large quickly so you need to consider this. This job runs on a schedule – the default is to execute every morning at 2AM. You can change this in the SQL Agent job itself.

There are two additional parameters that control this job:

@retention – the number of minutes that change rows are retained. The default is 4320 which is only 72 hours so if you are using this for audit then you need to up this considerably. The maximum is 100 years.

@threshold – the maximum number of rows that will be deleted in a single statement on cleanup

You configure these using the same stored procedure as before, just specifying the cleanup job. So if I want to change the retention to 1 year I’d call:

EXECUTE sys.sp_cdc_change_job   
    @job_type = N'cleanup',  
    @retention = 525600;

Again you need to stop and start the job before changes take effect.

If you want to check the current parameters for either job then the easiest way is to query the system table msdb.dbo.cdc_jobs.

And that’s about it. There is of course deeper you can go to understand CDC, but this should give you a good grounding to get started in confidence.

Matthew McGiffen Data

Month: October 2017