TDE and backup compression – still not working?

Until SQL 2016 if you used TDE (Transparent Data Encryption) you couldn’t use backup compression.

In 2016 Microsoft changed this, but it has been a rocky road. Backups work okay but in some circumstances people have found that they are corrupt when they come to restore them.

We thought that was all fixed, as highlighted by Brent Ozar, MS have been recommending that you are on at least SQL Server 2016 RTM CU6 or 2016 SP1 CU4 where these issues are resolved.

https://www.brentozar.com/archive/2017/09/breaking-news-using-tde-2016-backups-time-patch/

https://social.msdn.microsoft.com/Forums/sqlserver/en-US/dad3dfe3-c914-44f8-96b9-ab56cc825fe9/tde-corrupt-backups-when-using-backup-database-with-compression-maxtransfersize?forum=sqldatabaseengine

It seems like people are still having problems though. Ken Johnson is on 2106 SP1 CU6 and is having problems:

https://social.msdn.microsoft.com/Forums/sqlserver/en-US/dad3dfe3-c914-44f8-96b9-ab56cc825fe9/tde-corrupt-backups-when-using-backup-database-with-compression-maxtransfersize?forum=sqldatabaseengine

Our production DBA has just run a test and confirmed that in one test backing up 20 databases from one server and restoring them to another, about 10 have failed. Although in other environments this is working fine.

You can see the error if you try to verify the backup:

Date and time: 2018-02-16 09:52:10
Command: BACKUP DATABASE [XXXXX] TO DISK = XXXXX’ WITH CHECKSUM, COMPRESSION, MAXTRANSFERSIZE = 131072
Processed 95488 pages for database ‘XXXXX’, file ‘XXXXX’ on file 1.
Processed 3 pages for database ‘XXXXX’, file ‘XXXXX’ on file 1.
BACKUP DATABASE successfully processed 95491 pages in 2.375 seconds (314.113 MB/sec).
Outcome: Succeeded
Duration: 00:00:02
Date and time: 2018-02-16 09:52:12

Date and time: 2018-02-16 09:52:12
Command: RESTORE VERIFYONLY FROM DISK = ‘XXXXX’ WITH CHECKSUM
Msg 3189, Level 16, State 1, Line 1
Damage to the backup set was detected.
Msg 3013, Level 16, State 1, Line 1
VERIFY DATABASE is terminating abnormally.
Outcome: Failed
Duration: 00:00:03
Date and time: 2018-02-16 09:52:15

And it certainly doesn’t then work if you try to restore it:

5 percent processed.
Msg 3183, Level 16, State 1, Line 2
RESTORE detected an error on page (1:7763) in database “XXXXXXX as read from the backup set.
Msg 3013, Level 16, State 1, Line 2
RESTORE DATABASE is terminating abnormally.

We’re raising a call with Microsoft, will update when we know more.

In the meantime, can we trust backups of TDE with backup compression enabled? Given that we want to be absolutely sure our backups can save us in event of disaster the answer may be no. At a minimum we need to test each time we update our applications or patch SQL Server.

The side lesson if course is that we should always be regularly checking our backups are restorable.

More articles on TDE

What is Transparent Data Encryption?

Setting up Transparent Data Encryption (TDE)

Encrypting an existing database with TDE

Understanding Keys and Certificates with Transparent Data Encryption (TDE)

How Secure is Transparent Data Encryption (TDE) – and How to Prevent Hacking

Migrating or Recovering a TDE protected Database

 

Avoiding confusion with SQLCMD variables

SQLCMD variables can be a useful way of having changeable parameters for your SQL scripts, allowing you to specify the value from a command line, but also to control things you couldn’t manage through a SQL variable.

They can also be a little confusing the first time you see them.

Here’s an example:

:SETVAR MySQLCMDVar "Wibble"
SELECT * FROM dbo.Test WHERE TextValue='$(MySQLCMDVar)';

If you just run this then you might get the error:
Msg 102, Level 15, State 1, Line 1
Incorrect syntax near ‘:’.

The important thing is to understand that when you see a colon at the beginning of a statement in your T-SQL script, then this is a SQLCMD statement, and to parse these in SSMS you need to have SQLCMD mode enabled. You can do this from the “Query” menu:
SQLCMD1
Now I execute my script again, and it runs fine – though it returns no results.
Let’s look at another quick example:

:SETVAR TableName "Test"
SELECT * FROM dbo.$(TableName) WHERE TextValue='Wibble';

Here I’ve used the SQLCMD variable to define the name of the table in my query. This is the power of SQLCMD variables, you couldn’t do this with a normal SQL variable – the only way to do that would be to use dynamic SQL instead.

Let’s understand a little deeper what they are and how they work.

First of all, they don’t have a datatype, they are actually always text values. You can enclose them in double quotes or not –but I usually prefer to – although if you have spaces or other special characters then quotes are required.

You define them as follows:
:SETVAR SQLCMDVariableName “Whatever value you want”

And where you want to refer to them in your script you use a dollar sign and the variable name in brackets:
$(SQLCMDVariableName)

Rather than being a conventional form of variables, SQLCMD variables are actually tags for text replacement. It’s handy to understand this as it leads to some strange behaviours. What happens when you run a query with SQLCMD enabled, is that first of all the script is parsed and any SQLCMD statements are processed.

In the case of SQLCMD variables, first all the :SETVAR statements in the script are processed and each variable is assigned the correct value. Then all the references to each variable in the script are replaced with the literal value, it is then this modified version of your script (which you never get to see) which gets executed.

That’s why something like this doesn’t work:

:SETVAR TextVal "Hello There"
DECLARE @TextVal varchar(30);

SET @TextVal = $(TextVal);
SELECT @TextVal;

When I run this I get an error:
Msg 102, Level 15, State 1, Line 11
Incorrect syntax near ‘There’.

What’s going on? Both my SQL and SQLCMD variables are text aren’t they? Why doesn’t that work?

The answer lies in what I said before, the reference to a SQLCMD variable is just a tag to be replaced with the value defined elsewhere in the script. So in the above example what actually gets executed is:

DECLARE @TextVal varchar(30);

SET @TextVal = Hello There;
SELECT @TextVal;

Which isn’t valid SQL. What I should have done in my original SQL is to wrap the reference to the SQLCMD variable in single quotes:

:SETVAR TextVal "Hello There"
DECLARE @TextVal varchar(30);

SET @TextVal = '$(TextVal)';
SELECT @TextVal;

Now it works:

SQLCMD2

I mentioned you could pass SQLCMD variables from the command line – this can be handy if you’re executing scripts and you want to (for instance) specify the database name from outside. Watch out though, if you also assign a value in your script then it is the last value assigned that gets used.

I had a developer come to me complaining that SQL wasn’t picking up the SQLCMD variable he was passing through the command line, the answer was that he had another value assigned in the script. He thought that was dumb, so I asked the question “What would you expect to happen if you were writing C# code and assigned a value to a variable, and then assigned a new one to it – which would you expect it to hold – the first or the second?”

That doesn’t mean however that assignment of values to SQLCMD variables doesn’t display some counterintuitive behaviour. Look at the following query:

:SETVAR TextVal "Hello There"
SELECT '$(TextVal)';

:SETVAR TextVal "GoodBye"
SELECT '$(TextVal)';

So I set a value in my SQLCMD variable, output it with a select statement, then I change the value and output it again. Let’s look at the output:
SQLCMD3
What the…?! I’ve encountered issues before where I’ve tried to change the value of a variable and – having done something wrong – the value hasn’t updated. But here it looks like the first query is looking into the future!

This goes back to what I said earlier, first the :SETVAR statements are processed and the variable evaluated, only then are the references replaced in the script. This means you can’t have changing values for your SQLCMD variable throughout the execution of your script.

You can even see the same behaviour if you do this:

:SETVAR TextVal "Hello There"
SELECT '$(TextVal)';

:SETVAR TextVal "GoodBye"
SELECT '$(TextVal)';

:SETVAR TextVal "See you Later!"

Outputs:
SQLCMD35
I’ve said you can’t change the value of your SQLCMD variable through your script, technically it’s more accurate to say you can’t have different values within the same batch. So if you separate your script into separate batches using the GO statement, then you get a different result:

:SETVAR TextVal "Hello There"
SELECT '$(TextVal)';
GO

:SETVAR TextVal "GoodBye"
SELECT '$(TextVal)';

Outputs:
SQLCMD4
You might therefore think that the SQLCMD variable is only valid in the context of the batch in which is defined. So if I remove the :SETVAR in the second batch my script will fail:

:SETVAR TextVal "Hello There"
SELECT '$(TextVal)';
GO

SELECT '$(TextVal)';

Not so:
SQLCMD5
We see from this that a SQLCMD variable is not limited to the scope of a single batch – even though it gets re-evaluated on a batch by batch basis.

I’ll finish with something you might have attempted to do at some point. How about if I conditionally try to change a SQLCMD variable:

:SETVAR TextVal "Hello There"
IF 1=0
BEGIN
   PRINT 'Whoah!'
   :SETVAR TextVal "Maths is Broken"
END;

SELECT '$(TextVal)';

If I’ve not confused you too much with the above examples you can probably figure out what the output will be. That’s right:

SQLCMD6
This has just reminded me of a quote from “The Hitchhikers Guide to the Galaxy“ about the babel fish, particularly the last line:
“Now it is such a bizarrely improbable coincidence that anything so mind-bogglingly useful could have evolved purely by chance that some thinkers have chosen to see it as the final and clinching proof of the non-existence of God.
The argument goes something like this: “I refuse to prove that I exist,'” says God, “for proof denies faith, and without faith I am nothing.”

“But,” says Man, “The Babel fish is a dead giveaway, isn’t it? It could not have evolved by chance. It proves you exist, and so therefore, by your own arguments, you don’t. QED.”

“Oh dear,” says God, “I hadn’t thought of that,” and promptly vanishes in a puff of logic.

“Oh, that was easy,” says Man, and for an encore goes on to prove that black is white and gets himself killed on the next zebra crossing.”

 

The main take home from all this should be to avoid trying to use a SQLCMD variable like a normal one. Assign it once, at the top of your script or in a command line – then leave it alone!

Implementing Temporal Tables Where You Have Existing Data

In my post about auditing tools in SQL Server I mentioned a few tools we have at our disposal for capturing the change history of data.

One was Change Data Capture which I want into a bit more detail about in Setting up Change Data Capture (CDC)

I also referred to Temporal Tables which are available to us in the brave new world of SQL 2016.

Temporal tables aren’t just about monitoring change, they also provide really nice methods for being able to query historical data – to see what the values were at a particular point in time e.g.

SELECT * FROM dbo.SomeData FOR SYSTEM_TIME AS OF '1900-01-01';

This makes them easy to work with, but also opens up other applications for them over and above auditing purposes – for instance being able to maintain configuration data that changes over time, but where we need to reference what the value was historically.

I wanted to look in more detail about the process for changing an existing table so that it becomes a temporal one.

Let’s start by quickly looking at how we would define a temporal table for a fresh deployment:

CREATE TABLE dbo.SomeData(
   Id INT IDENTITY(1,1) NOT NULL
      CONSTRAINT PK_SomeData PRIMARY KEY CLUSTERED,
   SomeText VARCHAR(1000) NULL,
   ModifiedBy VARCHAR(128) NULL,
   StartDate datetime2(7) GENERATED ALWAYS AS ROW START NOT NULL,
   EndDate datetime2(7) GENERATED ALWAYS AS ROW END NOT NULL,
   PERIOD FOR SYSTEM_TIME (StartDate, EndDate)
)
WITH (SYSTEM_VERSIONING = ON (HISTORY_TABLE = dbo.SomeData_History));

There are a few things here additional to a normal table definition. First thing to note is we need two columns to support the Temporal versioning. I’ve called my columns StartDate and EndDate, these both have to be of DATETIME2 datatype to the highest precision (100 nanoseconds). These are specified as GENERATED ALWAYS AS ROW START\END so SQL will take care of assigning values when new rows are inserted or updated.

I’ve specified the PERIOD FOR SYSTEM TIME, so SQL knows that these are the fields to use for the temporal versioning.

Finally I’ve specified SYSTEM_VERSIONING for the table and what the history table should be called.

When I execute this you can find the table and corresponding history table displayed in SSMS as below:
Tempopral1
We have the table itself, and also the history table has been created for us. This is where the previous versions of rows of data will be stored.

So that’s pretty straightforward – but what if you want to change an existing table?

First of all let’s drop this one:

DROP TABLE dbo.SomeData;

Unfortunately that gives me an error message:
Msg 13552, Level 16, State 1, Line 90
Drop table operation failed on table ‘TestAudit.dbo.SomeData’ because it is not supported operation on system-versioned temporal tables.

As the message says, you can’t just drop Temporal tables. First you must turn system versioning off, then you can go ahead:

ALTER TABLE dbo.SomeData SET (SYSTEM_VERSIONING = OFF); 
DROP TABLE dbo.SomeData;

Dropping the table though leaves the History table behind – now seen as a normal table: Tempopral2

That’s handy as it means you don’t lose the history when the main table is deleted. For now though let’s just delete that as well:

DROP TABLE dbo.SomeData_History;

Now let’s create a new table that will be our “existing table” we want to change. For this example I’ll create a test table with a trigger that maintains a modified date, and throw in 1000 rows:

CREATE TABLE dbo.SomeData(
   Id INT IDENTITY(1,1) 
      CONSTRAINT PK_SomeData PRIMARY KEY CLUSTERED NOT NULL,
   SomeText VARCHAR(1000) NULL,
   ModifiedBy VARCHAR(128) NULL,
   ModifiedDate DATETIME NOT NULL 
      CONSTRAINT DF_SomeData_ModifiedDate DEFAULT (GETUTCDATE())
);
GO

CREATE TRIGGER dbo.TR_SomeData_ModifiedDate
   ON dbo.SomeData 
   AFTER UPDATE
AS
BEGIN
   UPDATE dbo.SomeData 
   SET ModifiedDate = GETUTCDATE()
   FROM dbo.SomeData 
   INNER JOIN Inserted
      ON dbo.SomeData.Id = Inserted.Id;
END;
GO

INSERT INTO dbo.SomeData(SomeText, ModifiedBy)
SELECT TOP 1000 'Blah', 'Me'
FROM sys.objects a CROSS JOIN sys.objects b;

I’m going to complicate things and say that when I turn this into a temporal table I want to keep the existing ModifiedDate column and use that as the start date for my system versioning.

The first thing you might try is to alter the column:

ALTER TABLE dbo.SomeData
ALTER COLUMN ModifiedDate DATETIME GENERATED ALWAYS AS ROW START NOT NULL;

Error:
Msg 13589, Level 16, State 1, Line 44
Column ‘ModifiedDate’ in table ‘TestAudit.dbo.SomeData’ cannot be specified as ‘GENERATED ALWAYS’ in ALTER COLUMN statement.

So I can’t do this in an alter statement. What I’ll actually have to do is create a new column specified to work with the system versioning, copy the data over, drop the old column and rename the new one.

First let’s create the additional columns and define the PERIOD FOR SYSTEM_TIME:

ALTER TABLE dbo.SomeData
ADD 
StartDate datetime2 GENERATED ALWAYS AS ROW start NOT NULL
   CONSTRAINT DF_SomeData_StartDate DEFAULT (GETUTCDATE()),
EndDate datetime2 GENERATED ALWAYS AS ROW END NOT NULL
    CONSTRAINT DF_SomeData_EndDate DEFAULT ('9999-12-31 23:59:59.9999999'),
PERIOD FOR SYSTEM_TIME (StartDate, EndDate);

Note that I’ve had to specify defaults for the new columns so that the existing data gets values populated (I didn’t have to do that when I simply defined the Temporal Table from scratch).

For the StartDate I’ve specified the current UTC date – this could be any date and time that’s not in the future, though note it should be a UTC one. If I’d tried to use GETDATE, as I’m currently on British Summer Time, I would get the following error:
Msg 13542, Level 16, State 0, Line 51
ADD PERIOD FOR SYSTEM_TIME on table ‘TestAudit.dbo.SomeData’ failed because there are open records with start of period set to a value in the future.

For the end date I have put the maximum value a DATETIME2 column can hold, this is required and is a way of saying that this record is (at the moment) valid indefinitely. If I try to put in an earlier date I receive this error:
Msg 13575, Level 16, State 0, Line 51
ADD PERIOD FOR SYSTEM_TIME failed because table ‘TestAudit.dbo.SomeData’ contains records where end of period is not equal to MAX datetime.

Now I’ve got the columns I could try copying my data over:

UPDATE dbo.SomeData SET StartDate = ModifiedDate;

I get another error:
Msg 13537, Level 16, State 1, Line 70
Cannot update GENERATED ALWAYS columns in table ‘TestAudit.dbo.SomeData’.

It turns out you can get around this by temporarily dropping the period for system time:

ALTER TABLE dbo.SomeData DROP PERIOD FOR SYSTEM_TIME;
UPDATE dbo.SomeData SET StartDate = ModifiedDate;
ALTER TABLE dbo.SomeData ADD PERIOD FOR SYSTEM_TIME (StartDate, EndDate);

Your logical thought might be here that I didn’t need to create the period for system time when I altered the table to add the new columns. I could have just added the columns, migrated the data, then only added the period afterwards. Trust me, I tried that too:
Msg 13509, Level 16, State 1, Line 51
Cannot create generated always column when SYSTEM_TIME period is not defined.

From here it turns out to be reasonable clear sailing.

Let’s run the final command to make the table Temporal:

ALTER TABLE dbo.SomeData 
SET (SYSTEM_VERSIONING = ON (HISTORY_TABLE = dbo.SomeData_History));

Here’s what it looks like in SSMS:
Tempopral3
You can see this is the same as the original fresh creation of a temporal table. The history table has been created for us. Just with an additional column in both tables.

Let’s remove the extra column and do the final rename:

ALTER TABLE dbo.SomeData DROP CONSTRAINT DF_SomeData_ModifiedDate;
ALTER TABLE dbo.SomeData DROP COLUMN ModifiedDate;
EXEC sp_rename 'dbo.SomeData.StartDate','ModifiedDate','COLUMN';

I was amazed the sp_rename didn’t give me an error. In fact if we look at the table again:
Tempopral4
You can see the column has been renamed in both the main table and the history table. Pretty neat!

I have at this stage forgotten to do one thing. Let’s just try to update a record and see what happens:

UPDATE dbo.SomeData SET SomeText = 'Wibble' WHERE Id = 2;

Msg 13537, Level 16, State 1, Procedure TR_SomeData_ModifiedDate, Line 7 [Batch Start Line 185]
Cannot update GENERATED ALWAYS columns in table ‘TestAudit.dbo.SomeData’.
The statement has been terminated.

Yep, don’t forget to remove your trigger!

DROP TRIGGER dbo.TR_SomeData_ModifiedDate;

And now we’re all done.

 

What if I use SQL Server Data Tools (SSDT) to make my change?

If you’re working with Database Projects in Visual Studio then you will be used to publishing changes straight from there – or deploying them using a DACPAC file.

In that case SSDT handles all this for you – though it does use a slightly different method. Rather than creating a new column, copying that data over, dropping the old column and renaming the new one, it does it at the table level.
That’s the same way some other changes are made with SSDT (or if you do them through the GUI in SSMS). The steps are broadly the same, just with tables instead of columns:

1) Create new temporal table
2) Turn SYSTEM_VERSIONING off
3) Copy across the data from the old table to the new
4) Turn SYSTEM_VERSIONING back on
5) Drop the old table
6) Rename the new table to the original name

If you’re doing this by “hand”, you could do it this way if you prefer, but it is probably a heavier operation that will require more down time. For instance you’ll need the whole table to be locked while the operation progresses as you don’t want to risk data changes once you’ve started copying the data over. The column method will also lock the table, but potentially there is less to update\copy so it should hopefully be quicker.

As a final point I should note that you may run into problems dropping the trigger with SSDT, it can be a little quirky around doing that. For the moment I’ll leave off from going into that in more detail – but if you get that issue feel free to give me a yell!

Setting up Change Data Capture (CDC)

As mentioned in my post Auditing Data Access in SQL Server for GDPR Compliance CDC can be a useful tool for capturing a change history for specific data – which is something we might be thinking about a bit more now the GDPR is heading our way.

It’s pretty straightforward to set up, and can be done easily against existing tables/data – which is handy.

As mentioned in the previous post though, you do need to be on Enterprise Edition or SQL 2016 SP1 or later to use it.

Let’s take a quick look at how you go about setting it up.

First let’s create a table that has some data we want to track:

IF OBJECT_ID('dbo.CDC') IS NOT NULL DROP TABLE dbo.CDC;
CREATE TABLE dbo.CDC 
(
    Id INT IDENTITY(1,1) CONSTRAINT PK_CDC PRIMARY KEY CLUSTERED, 
    SomeText VARCHAR(1000), 
    ModifiedBy VARCHAR(128), 
    ModifiedDate DATETIME
);

Next you have to enable CDC for the database:

EXEC sys.sp_cdc_enable_db;

Then you have to tell SQL to track changes for the specified table:

EXEC sys.sp_cdc_enable_table 
   @source_schema = N'dbo', 
   @source_name = N'CDC', 
   @role_name = NULL;

Note the final parameter @role_name in the above stored procedure. This allows you to specify a database role that will be used to limit access the change history. I’ve just set to this to Null which means that there is no additional role required. There are also other parameters you can specify which you will find defined here:
https://docs.microsoft.com/en-us/sql/relational-databases/system-stored-procedures/sys-sp-cdc-enable-table-transact-sql

Once you’ve followed those few steps CDC is up and running. The change table can be found under system tables in your database, in the CDC schema, and has the name of the original table with “_CT” appended:

CDC_ChangeTable

You can see some additional columns. Rather than capturing the time of the change we have the Start LSN which is the Log Sequence Number associated with the change. This isn’t always that useful to us, which is why is it important for you to capture the time yourself – i.e. the ModifiedDate column.

Also of interest to us is the _$operation column which tell us what operation was logged:
1 = delete
2 = insert
3 = update (old values)
4 = update (new values)

As you can infer from the above, when we do an update we will have a record for both the before and after value.

_$update_mask could be of use to us in some scenarios as it contains a bit mask based on the column ordinals of the table to show us which columns have changed.

Let’s quickly insert and update some data so we can see an example of what we get in the change table:

INSERT INTO dbo.CDC(SomeText, ModifiedBy, ModifiedDate)
SELECT 'SomeText', 'Matt', GETDATE();

UPDATE dbo.CDC 
SET 
SomeText = 'ANewValue', 
ModifiedBy = 'StillMatt', 
ModifiedDate = GETDATE();

SELECT * FROM cdc.dbo_CDC_CT;

Here’s the output from the select:

CDC_ChangeTable_data

You can see we have one row showing the initial insert (_$operation = 2), one row showing the values before the update (_$operation = 3) and one row showing the values after the update (_$operation = 4).

If you run this test yourself you might find it takes a few seconds for the changes to populate as there can be a lag.

So how does it work?

CDC works by parsing your transaction log. It looks for changes relating to the tables is it enabled for and then writes those changes off to the change tracking table. Doing things this way means that it doesn’t have to be a synchronous operation, i.e. it can process the transaction log well after your transaction has committed. So your DML (update/insert/delete) operation is not delayed.

The downside of that is that it does have to go through the whole log (though just the part since it last ran) and work out what has occurred, when we look at performance we’ll see that this seems to create additional work over other synchronous change capture operations. However that work happens in the background and can be limited by the settings you use to configure the CDC jobs, so it may be acceptable or even preferable to a direct increase in the transaction time for your data modifications. Like anything in life it’s a trade-off and you’ll need to decide what works best for you.

Other issues can be around log activity and size. Implementing CDC means that you will get increased read activity on your I/O subsystem where the log is stored for the given database(s). So if implementing CDC it will be good to monitor this to make sure your disks can cope. Also the logfiles themselves are likely to get bigger. This latter point is for two reasons, one is that the CDC capture is itself a logged process, the other is that operations that would normally truncate the log or allow space to be re-used cannot do so until CDC has processed that section of the log. So if latency builds up in the CDC capture operations log entries are likely to be retained for longer.

CDC is actually managed by a couple of SQL Agent jobs. Two for each database that has CDC enabled. It’s useful to understand these for a few reasons. First of all, obviously if SQL Agent is not enabled then CDC is not going to work. Also you don’t want someone wondering what these jobs are and disabling them. Finally you may find you want to perform additional configuration – either for performance or data retention.

The jobs are:

cdc.{DatabaseName}_capture

This does the actual work by calling a series of system stored procedures that do the heavy lifting. The job doesn’t have a schedule, rather it is set running when you enable CDC and then generally continues based on the configuration.

There are 4 parameters that can control the capture:

@continuous – This is a bit value. 0 specifies that the capture will run once and then the job will terminate – this is only really useful in testing scenarios. 1 means that the job will run continuously which is the normal operation.

@maxtrans – This determines how many transactions will be read from the log and written to the change tables. This process is referred to as a scan cycle. Note this is done in a single transaction.

@maxscans – This sets how many scan cycles will be carried out before the job pauses.

@pollinginterval – This is the amount of seconds the job will pause before running through the specified number of scan cycles once more.

When you enable CDC these parameters get set to defaults – @continous = 1, @maxtrans = 500, @maxscans = 10 and @pollinginterval = 5.

You can change any of these setting using the sys.sp_cdc_change_job stored procedure. Changes will be made for the job that relates to the current database context. You need to specify that you want to change the capture job, and new values for the parameters you want to change.

So, if I wanted to change @maxtrans to 1000 and increase the polling interval to 10 seconds I would run the following:

EXECUTE sys.sp_cdc_change_job   
    @job_type = N'capture',  
    @maxscans = 1000,  
    @pollinginterval = 10;

The job must be restarted before the new settings take effect. You can do that using two more system stored procedures:

EXEC sys.sp_cdc_stop_job @job_type = 'capture';
EXEC sys.sp_cdc_start_job @job_type = 'capture';

Here is a Microsoft Whitepaper that goes into detail on tuning CDC using these parameters:
https://technet.microsoft.com/en-us/library/dd266396(v=sql.100).aspx

cdc.{DatabaseName]_cleanup

This is the other job involved in CDC that manages retention of the change data for you. If you are monitoring data that changes a lot then the change table can get large quickly so you need to consider this. This job runs on a schedule – the default is to execute every morning at 2AM. You can change this in the SQL Agent job itself.

There are two additional parameters that control this job:

@retention – the number of minutes that change rows are retained. The default is 4320 which is only 72 hours so if you are using this for audit then you need to up this considerably. The maximum is 100 years.

@threshold – the maximum number of rows that will be deleted in a single statement on cleanup

You configure these using the same stored procedure as before, just specifying the cleanup job. So if I want to change the retention to 1 year I’d call:

EXECUTE sys.sp_cdc_change_job   
    @job_type = N'cleanup',  
    @retention = 525600;

Again you need to stop and start the job before changes take effect.

If you want to check the current parameters for either job then the easiest way is to query the system table msdb.dbo.cdc_jobs:

EXECUTE sys.sp_cdc_change_job   
    @job_type = N'cleanup',  
    @retention = 525600;

 

And that’s about it. There is of course deeper you can go to understand CDC, but this should give you a good grounding to get started in confidence.

Auditing Data Access in SQL Server for GDPR Compliance

In relation to the GDPR, I’ve recently been looking at the tools available within SQL Server to support auditing our databases. Many of you out there will already has full-scale auditing solutions in place. For others though, the prompting of GDPR could be the first time you’re really thinking about it.

This post is a brief overview and comparison of the SQL features that might help you out, with some suggestions of what you might want to think about in terms of auditing.

The GDPR has clear requirements that we must be able to provide detail of changes to personal data. That means being able to say who changed an item of data, when they changed it, and what was the before and after value.

The regulation also states that we must ensure that data is only accessed by appropriate parties, and for the explicit purposes that the person the data relates to has consented. We may feel that this latter requirement is satisfied by our security processes, but in practice, if we wish to prove this then we may also want to audit all data access – including when data is simply read but not changed.

For auditing we need to consider all personal data held, whether that is data provided by an individual or data about an individual that supplements this. Auditing should therefore not be limited to the definition of “personal” within GDPR, namely Personally Identifiable Information (PII). We may also, for our own purposes outside of the GDPR requirement, wish to audit other information related to our business process – for instance where approvals are required.

In terms of individuals accessing or changing data there are several contexts of which we need to be aware:
Application users

  • Support access via specific tooling
  • Support access direct to the database
  • Illegitimate access via an application or tooling
  • Illegitimate access direct to the database

Summary of Tools available within SQL Server

The table below lists the technologies we will look at as well as in what versions/editions of SQL Server they are available. Some versions are omitted where the availability of features is the same as in the previous version:

Audit

Profiler Trace

SQL Server traces allow you to capture information about all access to the database/server. As such this technology is sometimes used as part of an auditing solution – in fact both the C2 Auditing and Common Compliance Criteria features utilise it in one way or another.

It is more useful, however, for capturing instance level security events (such as Logons and failed Logins) than for monitoring changes to individual data items.

There can also be quite an overhead to running traces, though this can be minimised if they are created as server-side traces

Extended Events

Introduced in SQL 2008, Extended Events were intended as a lightweight replacement for Profiler Traces. Still, however, they would be more appropriate for capturing security events than for monitoring data access and change – although SQL Server Audit which we’ll look at later is implemented via Extended Events.

C2 Audit

This feature is supported in all versions and editions of SQL Server up to 2017, though it has been marked as deprecated for some time.
C2 was a rating established by the US Department for Defence applicable to security levels of computer systems, mainly focusing on access to resources. The C2 Audit within SQL server is implemented via server-side traces and captures events such as:

  • Server shutdowns and restarts
  • Successful and failed logins
  • Successful and failed use of permissions when accessing database objects
  • Successful and failed use of permissions when running any command against the database server

It can be seen from the events captured that while this sort of data capture may be useful as part of an overall audit strategy, it won’t cover the audit of individual data required by the GDPR.

If implementing C2 Auditing on an instance, it is important to be aware that it will have some performance impact on a server as each event must be written to the audit file before the operation can complete. In fact, if the audit file cannot be accessed (or disk space runs out) then SQL Server will stop completely until that is resolved. However, it is a limited set of items being audited so the performance impact in general should not be too considerable.

Common Criteria Compliance

Available from SQL 2005 SP1 Enterprise, and in standard edition from SQL 2008, Common Compliance Criteria is described by Microsoft as the replacement for C2 Audit.

Enabling this option configures a couple of other behaviours within SQL server to comply with the “Common Criteria for Information Technology Evaluation” International Standard.

From an auditing point of view, it enables a level of login auditing which can be viewed by querying sys.dm_exec_sessions for currently connected users.

This information includes:

  • Last Successful login time
  • Last unsuccessful login time
  •  Number of Unsuccessful login attempts since last successful login

Triggers

Available in all editions/versions from SQL 2005. Triggers were also available before that, but additional events were added in 2005.

It is a common usage of Triggers to perform some level of database auditing, for instance to capture a modified datetime for a record, or even the identity of the logged-on user. They can also be used to capture more information about changes – for instance to maintain Audit tables as data is inserted, updated or deleted.

The main potential disadvantage of triggers for Auditing is performance. Technologies designed specifically for the purpose may be more lightweight.

Change Tracking (CT)

Available in SQL Server 2008 standard edition onward, CT is not designed to be used for Audit. Rather it is intended to support ETL processes by identifying changes to tables.

When enabled Change Tracking simply identifies that a given row has been inserted, updated or deleted. It doesn’t capture the content of the change, for instance that a given column was changed from value ‘A’ to value ‘B’. It doesn’t maintain a change history, so if I insert a record, update it five times, then delete it, when I check change tracking it will only tell me that it was deleted.

Change Data Capture (CDC)

Available in SQL Server 2008 Enterprise Edition onward, from 2016 SP1 CDC is also available in Standard Edition. Like CT, CDC is not intended as an audit technology – however due to the level of information captured it can be useful in that context.

As well as identifying that data has changed, CDC captures the actual changes and maintains the history in a separate table. Unlike a similar hand-cranked solution using triggers, it manages this asynchronously so has only light performance overhead.

One of the key things missing from CDC from an Audit perspective is that while it captures the data change, it doesn’t also record the Login identity making the change or the datetime for the change.

A Connect item was raised to change this, however the Microsoft response was to reject by design as CDC “is not designed to be an auditing feature”. Microsoft’s recommendation is to use SQL Server Audit instead:

https://connect.microsoft.com/SQLServer/feedback/details/283707/cdc-options-to-capture-more-data-username-date-time-etc
Despite this, if the table itself has columns to capture the user identity and the modified datetime, and the application ensures these are set, then these columns will be included in the change data capture in which case a meaningful Audit of changes to data is provided.

CDC can be enabled on a per table basis, and you can configure which security role(s) have access to the change log. You can also define the retention period which avoids you having to implement a separate solution for cleaning up old audit data.

One thing to bear in mind is that due to the asynchronous process involved to log changes (which uses SQL Server Agent) there is no transactional guarantee that changes will reach the capture log, though the risk of data loss is likely to be minimal.

SQL Server Audit

Available in SQL Server 2008 Enterprise edition, Server level auditing came to Standard edition in 2012 and all features are available in all editions from 2016 SP1.

SQL Server Audit is the out of the box feature that is intended for the purposes of Audit, so logically we may think it is the best one for us to be using. Certainly, it is what Microsoft would recommend for auditing within the database.

Auditing is implemented using Extended Events and there are many Audit events you can enable it to capture. Events are either logged to a file, or to the Windows Security log where it is more difficult to tamper with them. You can then either view the events directly in that destination or through a GUI within SSMS.

Depending on the scope of the events monitored, you can choose to Audit either at an object level (for instance table by table) or at schema or database level.
In the context of GDPR we may choose to Audit data access – insert, select, update and delete. What this sort of Audit will give us is not, as we might expect, the ability to view who accessed or changed specific data. Rather it captures the queries being executed against the objects monitored.

As such it’s probably not the best method for being able to track a change history for a piece of data, though that could be done in theory by analysing the individual queries. Equally, if we want to know who read a specific record, this would require analysis of the queries executed.

Because of that, while SQL Server Audit may form an important part of an overall Audit strategy, it may not, on its own, give us what we need.

It does however allow us to capture all queries that have, for instance, touched certain tables. As such it is the only technology available to use within SQL Server that may allow us to monitor for abuse of privileges. For instance, I may be the DBA and require admin rights over the SQL Server instance, with SQL Server Audit the queries I execute against tables containing personal information would be captured and any dubious querying patterns could be identified.

An important part of the use of SQL Server Audit is going to be proactively monitoring the logs to make sure data access is being carried out appropriately.

Temporal Tables

Temporal tables were introduced in SQL Server 2016 and are a way of easily tracking changes to data – as well as being able to easily query what the value was at a given point in time.

Like CDC they do not capture the user context making the data change, so this must be handled by additional application logic if we require it.

The logging of change data for temporal tables is a synchronous operation performed as part of the transaction that changes data in the main table. Microsoft documentation however states that this creates negligible overhead for inserts, and while there is greater overhead for updates and deletes this is still significantly less than a custom solution (for instance using triggers).

Temporal tables work by having two user tables for each table being tracked. One is the original table – known as the current table, and the other is the history table which includes start and end times for which a given version of the data was current. New commands for querying the state of data at a given point in time make this a seamless operation.

Application Audit

One issue with any auditing that we perform at the database level is that any well designed application, from a security context, is likely to be accessing the database via a service account, rather than application users being granted direct database access. As such it is not possible for pure database auditing to capture a meaningful user context where operations are conducted through the application.

For instance, it may be quite legitimate from the application for a user to update details relating to a customer. We need to maintain a record of those changes so we require the application to log the user context – usually into an “UpdatedBy” column.

There may also be the case where semi-structured, or unstructured data is stored in blob forms, such as XML, JSON or other document formats. In these cases, a database level audit could capture the before and after states of the data, but it would require additional analysis to identify what actual elements of the data within the document had changed. If we want to provide an easily digestible change history then we need to consider implementing this at an application or tooling level.

Finally, we may want to record user access and activities from the application. For instance, the date and time of connections as well as what was accessed.

What might a best practice SQL Data Access Audit solution look like?

Which (if any) of these technologies we use is going to depend very much on the versions and editions of SQL Server we have installed. In an ideal world, to make our lives easy we would all have Enterprise edition – or be on SQL 2016 SP1 or later – and have an extensive choice. In some cases, it may be that the access to additional security and audit components becomes the compelling reason for choosing Enterprise over Standard edition, or for upgrading to a later version of SQL Server.

C2 or Common Compliance

It makes sense to implement one of these, ideally Common Compliance where this is available, as this gives us increased security within SQL Server at little cost and minimal effort to implement. Common Compliance is preferable both as C2 is deprecated and that C2 poorly managed (losing disk access) can create additional risks to application availability. Both offer additional levels of logging connections to our SQL Server boxes that may be useful in identifying those logging onto the server for nefarious purposes.

Change Capture

In particular for GDPR we will want to make sure we have some form of capture and logging of changes to personal data. We have a choice of technologies here.

Triggers

Triggers can provide a serviceable solution but the performance overhead is high so it may be better to use a built-in technology where one is available.

CDC

CDC provides a low overhead tool for managing change capture. However, it is only available in Enterprise Edition until SQL 2016 SP1.

Temporal Tables

These are the latest and greatest, so it may make sense to use where we are using SQL Server 2016, although as it is a synchronous operation it may have more of an immediate performance impact than CDC.

I hope to look at a performance comparison between these different methods of data capture in a later post.

Application Auditing

Whatever tool is used for change capture, application auditing is required to make sure the relevant user context is captured. It’s also worth considering logging user activity.

SQL Server Audit

For a best practice solution, where this is available to us it makes sense to use it. We can use it to capture data access even though the format is not going to be the most easy to analyse . We should also use it to capture events that could relate to anyone making changes to our Change Capture. If anyone makes changes to the Audit itself then this gets captured by the audit so we can detect tampering. We should make sure the Audit is saved to the Security log then that is also tamperproof. We can’t guarantee that we will capture all events, but we can reasonably guarantee that we will detect it if someone prevents us from doing so (e.g. turning off Auditing or Change Capture).

Conclusions

There are a number of components you may want to implement in order to deploy an effective data access auditing solution. In some cases your choices may be limited by the versions and editions of SQL your platforms currently sit on.
In some cases though, where the version and edition of SQL we use supports these technologies, then there is not a great deal of effort required to implement them.

In subsequent posts I’ll hope to show some simple examples in practice, as well as look briefly at the impacts on server performance.

When Deadlocks Become Art

We’ve all had to solve deadlock issues. Identify the two conflicting pieces of code, work out an appropriate change to one of them to avoid the issue. Job done.

Occasionally you see a deadlock that has three nodes. Okay, a little bit more analysis and head scratching before the solution presents itself

What if you have 10 nodes, though, twenty, thirty…

Sometimes the best thing to do might be just to sit back and appreciate the deadlock graph as a piece of art.

Here’s one we found related to an attempted DACPAC (SSDT) deployment in a continuous integration environment:

deadlock1

I found something quite satisfying in that picture. Was tempted to print it out and stick it up by my desk.

Later another colleague was asking for advice on solving deadlocks, I shared him the above as a fun example but also to reassure him “at least the one you’re looking at isn’t that bad”.

About a month later he came back with “I take your deadlock graph and raise you…”:

deadlock2

Now that’s what I call a deadlock!

Can anyone do any better? Either in the realms of the more elegant or the more extreme.

Unfortunately my blogging platform doesn’t allow images in the comments so tweet me if you have one you want to share @MattMcGiffen #whendeadlocksbecomeart

 

How to be a bad interviewer

tsql2sday150x150

In the monthly SQL blogging party that is T-SQL Tuesday (brainchild of Adam Machanic), Kendra Little has invited us this month to talk about interview patterns/anti-patterns.
https://littlekendra.com/2017/08/01/tsql-tuesday-93-interviewing-patterns-anti-patterns/?utm_campaign=twitter&utm_medium=twitter&utm_source=twitter

Before my current role which I’ve been in for just over 5 years, I spent most of my career as a contractor, performing shorter roles that lasted from 2 weeks to 2 years. As such I was frequently interviewing several times a year and I’ve often sat on the other side if the table as well.

In my experience it’s been far more often to find bad, arrogant or disrespectful interviewers than the other way round. My biggest advice to candidates is to know your worth, realise that the interview process is as much for you to find out whether this is going to be a decent place to work as the other way round.

You’re not obligated to accept a job once it’s offered, and if you’ve had to put up with a lot of rubbish during the application process then seriously consider if it’s just going to get worse once you’re in the role.

As an employer, you want the best people, particularly in IT where people are the company’s biggest asset. You need to make sure that your recruitment process is not just about assessing the candidates, but also showing them that your company is going to be a great place to work.

If you do want to put people off then here’s some of the interviewer “anti-patterns” I’ve experienced over my career that made me ask the question “Would I really want to work at this place?”

Maybe hitting one or two is okay, but beyond that you need to consider your self respect.

Expecting a candidate to fill in a lengthy application form

What’s wrong with a CV? Why is your company so special that I need to sit there for hours with some awful form in Word? You know what? I don’t think I’m that bothered.

Not responding to my application

I sent you my CV and you didn’t even acknowledge it. Or you did and said I’d hear whether I’d been selected by next week, a month and a half passes and then suddenly you decide you want to interview me. You know what, I’ve probably already found something else and even if I haven’t you’re not coming across as massively competent.

Turning up late for the interview

I’ve taken time out of my busy schedule to come see you guys. I arrive at the interview on time – actually I make sure I’m early – but you keep me waiting half an hour. Obviously your time is more important than mine. Okay there may have been a crisis you had to deal with, or is it just chaos all the time? Not sure I’m getting inspired to work here. Oh, and how would you have reacted if I was that late?

Prove how much better than the candidate you are by asking overly specific questions that few people know the answer to

You get your technical boffin in to review my technical skills, but rather than trying to find out what I know they ask me the most specific esoteric technical questions possible – “What sampling algorithm does SQL use when updating statistics?” Clearly their main interest is in trying to prove they know more than me. Is the rest of the team like this? Not sure I want to work with them. You know what, we all know different things and no-one knows everything. How about discussing some technical scenarios and seeing what I come up with.

Asking standard/stupid questions

“What’s my biggest weakness?” No-one is going to tell you they’re a lazy alcoholic. Of course we’re all perfectionist workaholics. All you learn from asking this sort of question is whether someone has learnt the standard answers. It’s just wasting everyone’s time.

Requiring visit after visit for subsequent interviews and tests

You got me in for technical test, then you asked me to come back for an interview. Then come back again for an interview with HR, then yet again to meet a member of the senior management team. Why couldn’t you at least do it all on one day? Apart from the fact that my time is precious too, by the time we get to the end of this long drawn out process I’ve already found a better job.

Expecting candidate to provide free work

Before the interview you’d like me to prove my technical skills by creating some functionality for you. I’ll just need to give up my weekend to do this free work for you. I’m not even got the job yet and you want me to do unpaid overtime. Goodbye!

Not giving a response or feedback when you said you would

We had the interview on Monday and you said I’d hear either way by the end of Wednesday. It’s Friday now and I’ve had no response. Once again you’re not inspiring me to believe this is going to be a great place to work when you don’t honour your basic promises.

This last one is a bit specific, but it did happen to me…

Give someone a C# test when they come for a SQL job, and then when challenged on it insist that they attempt it anyway as you think “it will still have some value.”

I’m ashamed to this day that I didn’t just walk out at that point!

 

Manually updating Statistics

Even though SQL Server automatically updates statistics in the background for you, you may find there are times when you want to manage updating them yourself.

  • You may have large tables and find that the interval between the automatic updates is too big and is resulting in sub-optimal query plans.
  • You might need timely updates to a specific object – maybe as part of an ETL process to make sure that statistics are up to date after a particular part of the process, perhaps after a daily load into a table.
  • You may find that the automatic updates look at too small a sample size and you need to scan more of the table to enable accurate estimates for your queries.

My previous post on the Ascending Key problem demonstrated a common issue where the first of these scenarios could be affecting you. We’ll look at examples of the other scenarios in subsequent posts.

For now though, let’s just look at how you go about updating statistics.

UPDATE STATISTICS

At the heart of all the methods we’ll look at is the UPDATE STATISTICS command. There are a lot of options for using this command, but we’ll just focus on the ones you’re most likely to use. For full documentation here is the official reference:
https://docs.microsoft.com/en-us/sql/t-sql/statements/update-statistics-transact-sql

You can use this at the table level in which case all statistics objects belonging to that table will be updated:

UPDATE STATISTICS dbo.Test;

Or you can specify the specific statistics object you want to update:

UPDATE STATISTICS dbo.Test _WA_Sys_00000002_3AD6B8E2;

FULL SCAN or SAMPLE

The other thing you may be likely to want to specify is whether the statistics should be updated using a full scan of the table, or just be looking at a sample of the rows. In the above examples we didn’t specify this so SQL Server will decide for us. In general sampling (as opposed to full scans) kicks in when we have about 8MB of data in the table (or about 1000 pages).

If you want to specify a full scan the syntax is as follows:

UPDATE STATISTICS dbo.Test _WA_Sys_00000002_3AD6B8E2 WITH FULLSCAN;

 

If you want the statistics update to use sampling (more on how this works in subsequent posts) then you can choose to specify a percentage of the total table to be sampled:

UPDATE STATISTICS dbo.Test _WA_Sys_00000002_3AD6B8E2 WITH SAMPLE 10 PERCENT;

 

Or you can specify a fixed number of rows:

UPDATE STATISTICS dbo.Test _WA_Sys_00000002_3AD6B8E2 WITH SAMPLE 10000 ROWS;

 

You might want to use a sample as once your tables get large full scans can take a little time. Equally though if you’re updating statistics in a quieter time (e.g. overnight) you may feel you can afford the extra time for the full scans. Here’s some comparison figures I produced on my local machine showing how long full scans take. Obviously this will change depending on your hardware and other factors including how much of the table is already in memory:

FullScanTimes

You can see however that the time taken pretty much scales linearly as the table size increases.

One thing to be aware of is parallelism. A full scan can run as a parallel operation on your server which can speed it up considerably. When you update statistics using a sample however this can only run single-threaded unless you’re on SQL Server 2016 (or higher). Sampling data to build statistics in parallel was one of the many excellent little improvements in 2016.

sp_UpdateStats

This is another method you might use for manually updating statistics (perhaps as part of a scheduled maintenance job). This system stored procedure can be used for updating all of the statistics objects in a database:

USE Test;
EXEC sp_UpdateStats;

This stored procedure iterates through your database using a WHILE loop and executes the UPDATE STATISTICS command as it goes. One nifty thing about using this procedure is that it only updates statistics objects where rows have changed, so you don’t have any overhead for refreshing statistics where the underlying data hasn’t been modified. You can see this from this extract of the output of the stored procedure:

Updating [dbo].[TestMemory]
[PK__TestMemo__3214EC070D799003], update is not necessary…
0 index(es)/statistic(s) have been updated, 1 did not require update.

Updating [dbo].[TestMemory2]
[PK__TestMemo__3214EC07D3DC52DE], update is not necessary…
0 index(es)/statistic(s) have been updated, 1 did not require update.

Updating [dbo].[Test]
[PK_Test] has been updated…
[IX_Test_TextValue] has been updated…
2 index(es)/statistic(s) have been updated, 0 did not require update.

 

Ola Hallengren

Of course, if you’re looking to implement statistics update as part of regular maintenance, then you should definitely be considering using Ola Hallengren’s maintenance solution. Ola maintains a great solution for managing database backups and integrity checks as well index and statistics maintenance, and he shares it with the SQL community for free.

You can find full details of Ola’s solution and download it here:
https://ola.hallengren.com/

SQL Server Agent and Daylight Saving time

There was a bit of chat today on the SQL Community slack about UTC and Daylight Saving Time. It reminded me I have a post in draft somewhere about the complications of dealing with timezones in SQL that I should get around to finishing.

It also reminded me of a fact I found in an old Paul Randall post from 2007 that not a lot of people seem to know so I thought it was worth sharing again.

I discovered it when I was working on a kind of hand-cranked version of log shipping with a twist for SQL Server. Before you ask me why… don’t! I promise it was for a good reason. At least it seemed like a good idea at the time…

Anyway, I was aware that I needed to work out what I was going to do when the clocks go back in October of each year.

I decided to try and find out what the out of the box log shipping does to cater for this. I started with Google and came across this 2007 article by Paul Randall:

How does daylight savings time affect disaster recovery? – Paul S. Randal

Apparently SQL Server Agent pauses for an hour after the clocks go back. In the UK the clocks go back at 2 AM on the last Sunday in October. So all the times from 1 AM to 2 AM effectively happen twice. SQL Server Agent will run jobs from 1 AM to 2 AM as usual, then when at 2 AM the clocks go back to 1 AM, SQL Agent will pause for an hour (i.e. no new jobs will be started) and only resume the second time it gets to 2 AM.

Sounds a bit crazy in some ways, but I tend to believe most things Paul Randall says and it does seem to have been proved since in practice.

It means that you don’t get a confusing sequence of times in your job history and other events that don’t really make sense, and it also means that jobs won’t happen twice e.g. if you had an index rebuild in that time.

It could have issues for DR though, imagine that this period is actually an active one for your application and that you have an RPO (recovery point objective) of a maximum of (let’s say) 5 minutes data loss. During that hour, no LOG backups are being taken. Suddenly you’re vulnerable to not being able to meet your DR service agreement. Could be something to watch out for.

For me though (at the time I learnt this) it was great as I didn’t need to worry about coding a workaround in my log restore code!

 

Row-count Estimates when there are no Statistics

I was discussing Cardinality Estimation with a colleague recently and the question came up, what cardinality does SQL Server use if you’re selecting from a column where there are no statistics available? I’ve discovered there are a few algorithms in play depending on how you’re querying the table. In this post we’ll look at where we have a predicate looking for a fixed value.

(If you want the short answer, it’s the fourth root of n cubed before SQL 2014 and the square root of n after that.)

This scenario can occur if you have AUTO CREATE STATISTICS turned off for your database, which we don’t recommend you do, but which you might choose to do anyway, and if you query a table with a predicate against a column with no index defined against it.
Let’s look at example. I’m using SQL Server 2012 for this investigation, and to start with am querying the AdventureWorks2012 database.
I’ve taken the following preparatory steps:

  • Set AUTO CREATE STATISTICS OFF for the database
  • Remove the Index on the LastName column for the Person.Person table
  • Removed any ad-hoc statistics that existed against the table

Then I run a simple query, with the Actual Execution Plan turned on:

SELECT *
FROM Person.Person p
WHERE p.LastName = 'Fox';

I only get one result out as there is only one Fox (Ms. Dorothy J.). Let’s look at the execution plan:

RowCountNoStatistics1

A clustered index scan as we might expect as I’ve removed any useful indexes from the table. You’ll notice there is a warning. If we view the tooltip you’ll see SQL warns us about the lack of statistics:

RowCountNoStatistics2

If we look at the estimated and actual row-counts we’ll see how that has affected us:

RowCountNoStatistics3

In the absence of any useful information – it knows the number of rows in the table but that is about it – SQL has estimated that there will be 1680 Foxes in the table. A bit of playing shows that we get the same estimate whatever value we search for.

If I turn AUTO CREATE STATISTICS on and run the query again then SQL generate a Statistics object against the LastName column and comes up with an estimate of 2.3 rows – which is a lot closer.

This matters a lot once you start running more complicated queries. The incorrect estimate is likely to affect the choice of plan that the optimizer makes, and may also affect the amount of memory it requests in order to run the query. Let’s look at a quick example of how the plan changes if we join the above query to another table.

First, without statistics (so I have to turn AUTO CREATE off again, and remove the statistics that got created):

SET STATISTICS IO ON;

SELECT e.EmailAddress
FROM Person.Person p
INNER JOIN Person.EmailAddress e
    ON p.BusinessEntityID = e.BusinessEntityID
WHERE LastName = 'Fox';

Here’s the execution plan:

RowCountNoStatistics4

You can see I’ve got a Merge Join as SQL thinks I’m expecting 1680 rows from the top table. A Nested Loops join would generally be better when I only expect one or two rows from that table.

I’ve also captured the IO so I can see how expensive the query was:

Table ‘EmailAddress’. Scan count 1, logical reads 10, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table ‘Person’. Scan count 1, logical reads 3818, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Let’s look at the behaviour of the same query with statistics creation enabled:

RowCountNoStatistics5

You can see we now have the desired Nested Loops join and the Clustered Index Scan on the EmailAddress table has been changed to a Seek.

The IO output is below:

Table ‘EmailAddress’. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table ‘Person’. Scan count 1, logical reads 3818, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

There’s not much difference in the overall IO, but you can see the Reads for the EmailAddress table have dropped from 10 to 2 due to the change from the Scan to the Seek. If the table was a lot bigger then we could see a large difference here.

So where does that estimate come from?

I thought I’d have a play and see if I could work out how SQL decided on that estimate of 1680 rows. I did some googling and found a suggestion that it might be a straight 9% of the total number of rows in the table, but that doesn’t quite add up and when I compared the same query pattern against a few tables I found I got a different ratio depending on the amount of rows in the table.

So I pumped rows incrementally into a fresh table and looked at the estimate and what the ratio was as the number of rows increased. Here’s my SQL for that:

--Create a Horrible Heap for my testing  
CREATE TABLE dbo.TestStats(TestVal INT NOT NULL, TestText VARCHAR(255) NULL);

--Insert a bunch of rows using dodgy old-style cross joins
INSERT INTO TestStats
SELECT TOP 1 --Amount of rows I'm going to insert
    1 , 'blah'
FROM sys.objects a, sys.objects b, sys.objects c, sys.objects d, sys.objects e;

--Clear the plan cache so SQL generated a new estimate  
DBCC freeproccache;

--Query the table, then I go check the execution plan generated 
--for the estimated number of rows
SELECT * FROM dbo.TestStats
WHERE TestVal = 1;   

(One thing to note was that I got the same answers whether I was querying the text column or the integer column – SQL seems to use the same algorithm for both.)

I started to notice a pattern quite quickly, that the ratio halved when the number of rows went up by a factor of 16. I then restarted my test, targeting my row-counts to be where the estimated number of rows would be a nice round number. You can see that in the table below:

RowCountNoStatistics6

I then attempted to work out a formula for that. Rather than take you through the shoddy process of mathematics that led me to an answer, I’ll just tell you that the formula came out as:

RowCountNoStatistics6b

Where e is the estimated number of rows for a given predicate value, and n is the total number of rows in the table. I checked that against the full set of results I’d gathered and it held true across all values of n I’d tested.

To check it finally against my original query – the Person.Person table had 19,972 rows. I put that through the calculator with the formula and get 1680.027. If we look back at the original estimate you’ll see that SQL stated 1680.03  – so that is all good.

As I mentioned earlier I was using SQL Server 2012 for this test, and a new Cardinality Estimator came into effect in SQL 2014. So I thought I’d run the test again with SQL 2016 and see if the results changed:

RowCountNoStatistics7

We can see the estimated rows drop off a lot quicker here. Clearly Microsoft have decided to lower the estimate. Actually it is now just the square root of the total number of rows.

RowCountNoStatistics8

Hopefully you’re not in the scenario where you regularly have queries running without the appropriate statistics to support them. The above comparison though shows us that if you have such a query its behaviour could dramatically change if you upgrade your SQL Server version to 2104 or higher. It could become better or it could become a lot worse.

There are a lot of changes like this that came in with the new version of the Cardinality Estimator in 2014. Places where underlying assumptions have been adjusted to make better guesses about the number of rows that will be returned by an operator. But they are still guesses based on the same information – there is no new data being captured in the Statistics to better inform the process. Of course Microsoft has made these changes to try and better model data out in the wild – but they are still fixed assumptions, which means sometimes they will be better and sometimes they will be worse.

One thing I should re-iterate is that these formulae we’ve discovered above are for a fairly specific querying pattern. There’s no guarantee that the calculation will be the same for a similar – but different query. It might be interesting to explore that further in a later post.

Also there may be other information in your database – such as constraints – that SQL can use to educate its guesses.

The main takeaway from all of this though, should of course be:

MAKE SURE AUTO CREATE STATISTICS IS TURNED ON FOR YOUR DATABASES!