Setting up Change Data Capture (CDC)

As mentioned in my post Auditing Data Access in SQL Server for GDPR Compliance CDC can be a useful tool for capturing a change history for specific data – which is something we might be thinking about a bit more now the GDPR is heading our way.

It’s pretty straightforward to set up, and can be done easily against existing tables/data – which is handy.

As mentioned in the previous post though, you do need to be on Enterprise Edition or SQL 2016 SP1 or later to use it.

Let’s take a quick look at how you go about setting it up.

First let’s create a table that has some data we want to track:

IF OBJECT_ID('dbo.CDC') IS NOT NULL DROP TABLE dbo.CDC;
CREATE TABLE dbo.CDC 
(
    Id INT IDENTITY(1,1) CONSTRAINT PK_CDC PRIMARY KEY CLUSTERED, 
    SomeText VARCHAR(1000), 
    ModifiedBy VARCHAR(128), 
    ModifiedDate DATETIME
);

Next you have to enable CDC for the database:

EXEC sys.sp_cdc_enable_db;

Then you have to tell SQL to track changes for the specified table:

EXEC sys.sp_cdc_enable_table 
   @source_schema = N'dbo', 
   @source_name = N'CDC', 
   @role_name = NULL;

Note the final parameter @role_name in the above stored procedure. This allows you to specify a database role that will be used to limit access the change history. I’ve just set to this to Null which means that there is no additional role required. There are also other parameters you can specify which you will find defined here:
https://docs.microsoft.com/en-us/sql/relational-databases/system-stored-procedures/sys-sp-cdc-enable-table-transact-sql

Once you’ve followed those few steps CDC is up and running. The change table can be found under system tables in your database, in the CDC schema, and has the name of the original table with “_CT” appended:

CDC_ChangeTable

You can see some additional columns. Rather than capturing the time of the change we have the Start LSN which is the Log Sequence Number associated with the change. This isn’t always that useful to us, which is why is it important for you to capture the time yourself – i.e. the ModifiedDate column.

Also of interest to us is the _$operation column which tell us what operation was logged:
1 = delete
2 = insert
3 = update (old values)
4 = update (new values)

As you can infer from the above, when we do an update we will have a record for both the before and after value.

_$update_mask could be of use to us in some scenarios as it contains a bit mask based on the column ordinals of the table to show us which columns have changed.

Let’s quickly insert and update some data so we can see an example of what we get in the change table:

INSERT INTO dbo.CDC(SomeText, ModifiedBy, ModifiedDate)
SELECT 'SomeText', 'Matt', GETDATE();

UPDATE dbo.CDC 
SET 
SomeText = 'ANewValue', 
ModifiedBy = 'StillMatt', 
ModifiedDate = GETDATE();

SELECT * FROM cdc.dbo_CDC_CT;

Here’s the output from the select:

CDC_ChangeTable_data

You can see we have one row showing the initial insert (_$operation = 2), one row showing the values before the update (_$operation = 3) and one row showing the values after the update (_$operation = 4).

If you run this test yourself you might find it takes a few seconds for the changes to populate as there can be a lag.

So how does it work?

CDC works by parsing your transaction log. It looks for changes relating to the tables is it enabled for and then writes those changes off to the change tracking table. Doing things this way means that it doesn’t have to be a synchronous operation, i.e. it can process the transaction log well after your transaction has committed. So your DML (update/insert/delete) operation is not delayed.

The downside of that is that it does have to go through the whole log (though just the part since it last ran) and work out what has occurred, when we look at performance we’ll see that this seems to create additional work over other synchronous change capture operations. However that work happens in the background and can be limited by the settings you use to configure the CDC jobs, so it may be acceptable or even preferable to a direct increase in the transaction time for your data modifications. Like anything in life it’s a trade-off and you’ll need to decide what works best for you.

Other issues can be around log activity and size. Implementing CDC means that you will get increased read activity on your I/O subsystem where the log is stored for the given database(s). So if implementing CDC it will be good to monitor this to make sure your disks can cope. Also the logfiles themselves are likely to get bigger. This latter point is for two reasons, one is that the CDC capture is itself a logged process, the other is that operations that would normally truncate the log or allow space to be re-used cannot do so until CDC has processed that section of the log. So if latency builds up in the CDC capture operations log entries are likely to be retained for longer.

CDC is actually managed by a couple of SQL Agent jobs. Two for each database that has CDC enabled. It’s useful to understand these for a few reasons. First of all, obviously if SQL Agent is not enabled then CDC is not going to work. Also you don’t want someone wondering what these jobs are and disabling them. Finally you may find you want to perform additional configuration – either for performance or data retention.

The jobs are:

cdc.{DatabaseName}_capture

This does the actual work by calling a series of system stored procedures that do the heavy lifting. The job doesn’t have a schedule, rather it is set running when you enable CDC and then generally continues based on the configuration.

There are 4 parameters that can control the capture:

@continuous – This is a bit value. 0 specifies that the capture will run once and then the job will terminate – this is only really useful in testing scenarios. 1 means that the job will run continuously which is the normal operation.

@maxtrans – This determines how many transactions will be read from the log and written to the change tables. This process is referred to as a scan cycle. Note this is done in a single transaction.

@maxscans – This sets how many scan cycles will be carried out before the job pauses.

@pollinginterval – This is the amount of seconds the job will pause before running through the specified number of scan cycles once more.

When you enable CDC these parameters get set to defaults – @continous = 1, @maxtrans = 500, @maxscans = 10 and @pollinginterval = 5.

You can change any of these setting using the sys.sp_cdc_change_job stored procedure. Changes will be made for the job that relates to the current database context. You need to specify that you want to change the capture job, and new values for the parameters you want to change.

So, if I wanted to change @maxtrans to 1000 and increase the polling interval to 10 seconds I would run the following:

EXECUTE sys.sp_cdc_change_job   
    @job_type = N'capture',  
    @maxscans = 1000,  
    @pollinginterval = 10;

The job must be restarted before the new settings take effect. You can do that using two more system stored procedures:

EXEC sys.sp_cdc_stop_job @job_type = 'capture';
EXEC sys.sp_cdc_start_job @job_type = 'capture';

Here is a Microsoft Whitepaper that goes into detail on tuning CDC using these parameters:
https://technet.microsoft.com/en-us/library/dd266396(v=sql.100).aspx

cdc.{DatabaseName]_cleanup

This is the other job involved in CDC that manages retention of the change data for you. If you are monitoring data that changes a lot then the change table can get large quickly so you need to consider this. This job runs on a schedule – the default is to execute every morning at 2AM. You can change this in the SQL Agent job itself.

There are two additional parameters that control this job:

@retention – the number of minutes that change rows are retained. The default is 4320 which is only 72 hours so if you are using this for audit then you need to up this considerably. The maximum is 100 years.

@threshold – the maximum number of rows that will be deleted in a single statement on cleanup

You configure these using the same stored procedure as before, just specifying the cleanup job. So if I want to change the retention to 1 year I’d call:

EXECUTE sys.sp_cdc_change_job   
    @job_type = N'cleanup',  
    @retention = 525600;

Again you need to stop and start the job before changes take effect.

If you want to check the current parameters for either job then the easiest way is to query the system table msdb.dbo.cdc_jobs:

EXECUTE sys.sp_cdc_change_job   
    @job_type = N'cleanup',  
    @retention = 525600;

 

And that’s about it. There is of course deeper you can go to understand CDC, but this should give you a good grounding to get started in confidence.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s