Query Store Hints in SQL Server 2022

Another neat little feature in SQL Server 2022 is Query Store Hints. This is the ability to apply a query hint through Query Store rather than having to modify existing code or fiddle around with plan guides.


Query hints are a way to influence optimizer behaviour towards generating desired execution plans for a given query. The word “hint” is a bit of a misnomer as usually they mandate what you wish to happen.


I’m not a big fan of leaping straight away to use query hints as they force the behaviour rather than letting the optimizer work out the best plan for your query. This means you may implement hints that you see as desirable for the current issue you are trying to fix, but they create problems for different executions of the query, or if the data profile changes. In general it’s better to see if you can rewrite the SQL code to achieve what you want, but sometimes that’s not possible and a hint ends up being the best way to go – or at least a good short term solution.


The great thing about applying hints through query store is that it can be done without altering application code. That means the DBA can apply a temporary hint while waiting for code to be fixed – or in extremis use them against third party applications where there is not the ability to get code changed.


I first read about them in Bob Ward’s book SQL Server 2022 Revealed. You can also read about them in Grant Fritchey’s excellent (and long!) book SQL Server 2022 Query Performance Tuning.


Applying a hint is simple as long as the query has a plan stored in query store. First you need to obtain the query_id for the query. One method for doing that is to use the following query I got from this stackexchange post:


SELECT
qsq.query_id,
qsq.last_execution_time,
qsqt.query_sql_text
FROM sys.query_store_query qsq
INNER JOIN sys.query_store_query_text qsqt
ON qsq.query_text_id = qsqt.query_text_id
WHERE
qsqt.query_sql_text LIKE '%your query text%';

Armed with the query_id, you can then apply the hint using the sys.sp_query_store_set_hints stored procedure. For example, if my query_id was 100 and I want to apply the RECOMPILE hint then I could do that with this SQL:


EXEC sys.sp_query_store_set_hints 100, N'OPTION(RECOMPILE)';

You can also view which queries have hints applied using the sys.query_store_hints view.


Finally if you want to clear a hint you have previously applied (such as the one above), you can do that as follows:


EXEC sp_query_store_clear_hints @query_id = 100;

Query store does not support all query hints. You see the full list of those supported (and not) at:

https://learn.microsoft.com/en-us/sql/relational-databases/system-stored-procedures/sys-sp-query-store-set-hints-transact-sql?view=sql-server-ver16

And that’s it – it’s that simple. Thanks for reading!

Encrypting Your SQL Data Should be Your Default Option

Cyber-attacks and data breaches seem to be perpetually on the rise. Every week we hear news of large companies that have had data stolen with both financial and reputational repercussions. It should be clear that access controls aren’t sufficient to fully protect us from attacks and we need a second line of defence – which is encryption.


Brent Ozar ran a twitter poll a few months ago asking if people are encrypting their databases and backups. We see the results here:

You can see that a small majority of DBAs who follow Brent on Twitter and respond to polls use some form of at-rest encryption. I guess that means that in the larger population the majority aren’t encrypting their data at all.


There are a number of reasons why that might be the case, the only good one is that your database doesn’t hold any personally identifiable information (PII) or sensitive data which will certainly be true in some cases but not in most.


Other reasons are likely to be around effort and expense involved. With the advances Microsoft has made in the encryption feature set over the last several versions of SQL Server I’d argue that those reasons are not as valid as they might have once been.


There are four areas you should be thinking about in terms of properly protecting your data.


Passwords
Passwords are the most sensitive piece of information we hold in our systems. If these are accessed then an attacker can access all a user’s information in our system, and likely in other services they use if they re-use passwords. We shouldn’t even store passwords at all, but should use hashing and salting instead. Most people are probably pretty good about following good practices around passwords, but you’re not then stop reading now and go start having a conversation with your team about how you can address that.


At-rest Encryption
Protecting our data where it is stored on disk is an important part of an overall encryption strategy. Some people will use disk encryption for this but it is better if you can use TDE (Transparent Data Encryption) which will protect you against those who get read rights against the file systems on your servers. I get that TDE can be expensive as it required Enterprise Edition to use it, but that changed with SQL 2019. If you are on an older version then that alone can be a compelling reason to upgrade. Where that is not possible, if you are on SQL 2014 or higher then you can at least use the backup encryption feature to give protection to your backup files.


In-transit Encryption
In my experience most people do use TLS to protect their data is it is transmitted across the network, so that is good news. Again, if you’re not doing this then start making plans to change that. TLS is easy to implement and has little downside in return for the protection it offers.


Column Encryption
At-rest and in-transit encryption are great, but they don’t protect you if someone gains direct read access against your database. That’s where column encryption comes in. Column encryption used to require significant effort to implement, but with the introduction of Always Encrypted in SQL Server 2016 it became a lot easier. In many cases data can be encrypted with no code change required in your applications and giving you the highest level of protection possible. With Always Encrypted, data is protected at-rest, in memory and in-transit. It only ever exists in unencrypted form when it reaches your application.


We should use column encryption against our PII and sensitive data. I know Always Encrypted does give you limitations around how you work with your data, but in most cases you’ll have some data where those limitations are not an issue. I’d recommend you start encrypting those items now, and then over time work out to to encrypt the rest bit by bit.


I’m going to go into deep detail about all these forms of encryption on this blog over the coming months. If you want to know more about TDE and Always Encrypted I’ve already got a lot of posts – just use the search bar on the right.


Or, if you want the comprehensive guide to SQL Server Encryption in one go you can buy my book available through the link below:

http://www.dpbolvw.net/click-100730415-12898971?url=https%3A%2F%2Flink.springer.com%2Fbook%2F10.1007%2F978-1-4842-8664-7

Thanks for reading!

My SQL Server Encryption book has just been published!

It’s been a long journey, but I got the email from Apress yesterday to say that my book “Pro Encryption is SQL Server 2022” has now been published.

You can buy a copy at a massively reduced introdctory price (nearly 80% off) through the following link:

http://www.dpbolvw.net/click-100730415-12898971?url=https%3A%2F%2Flink.springer.com%2Fbook%2F10.1007%2F978-1-4842-8664-7

If I’ve acheived my aims then this is the most comprehensive resource out there covering the tools available in SQL Server for encrypting your data. I hope to supplement it with a video course covering the same topics sometime in early 2023.

The book covers the following topics:

Part I: “Understanding the Landscape”
Discusses what we are hoping to achieve through encryption and why. We then look in brief at the tools available and how they should fit into your overall strategy.

Part II: At-Rest Encryption
Here we look at the tools you can use to encrypt your data where it is stored on the disk. We take a deep dive into Transparent Data Encryption (TDE) as well as covering the separate Backup Encryption feature.

Part III: Column Encryption Using Always Encrypted
This part is a comprehensive look at the Always Encrypted feature, introduced in SQL Server 2016, that allows you to encrypt your columns of data with the maximum level of security and the minimum amount of application rework.

Part IV: Column Encryption Using Always Encrypted with Enclaves
Having looked at the “basic” version of Always Encrypted, we now take an in-depth look at how the version with enclaves differs and how you work with it.

Part V: Completing the Picture
In addition to the big features like TDE and Always Encrypted, there are other items that should be included in a comprehensive encryption strategy such as TLS and EKM. We cover those in this part as well as looking at other methods that are available for encrypting data.

Appendixes
In the appendixes I look at how encryption differs (or not) if you are running on a cloud platform as well as exploring encryption algorithms in a little more depth.

Over the next few months I’m going to blog about most of the topics in the book. If you’re interested in encryption though, then now is the time to buy before the publishers put the price up!

Optimized Plan Forcing in SQL Server 2022

I’ve been reading Bob Ward’s book “SQL Server 2022 Revealed” and one of the new features in Intelligent Query Processing (IQP) jumped out at me. Not because it’s going to rock my world but rather it reminded me of an idea I had when I first started playing with Query Store when it came out.

I thought it would be cool if the execution plans cached in Query Store could be used to populate the plan cache after an instance restart, removing the need for all queries to be recompiled. I even raised a suggestion on Microsoft Connect for it to be considered as a new feature.

I won’t claim that anyone at Microsoft saw my post and thought “What a great idea!”, but by coincidence they have implemented in SQL 2022 something similar that works in specific scenarios – Optimized Plan Forcing.

The underlying idea behind it is to store – in query store – something called a compilation script that can be used to shorten significantly the compile time for a given query. Any query that runs when query store is enabled will get a compilation script stored if it has a significant compile time (the actual threshold is not disclosed). We can see can see which queries have compilation scripts by looking at the has_compile_replay_script in the sys.query_store_plan view.

If the query plan is forced in Query Store then the next time it needs to be compiled the optimization script will be used to build the plan instead of going through the usual query optimizer , greatly reducing the compile time.

Of course this only works for forced plans, but that makes sense as for other queries we often want a recompile to come up with a new and hopefully better plan. Still it’s an interesting feature. Bob talks about how it was designed in response to a pattern of performance problems they were calling a compile storm. This would be where the plan cache was emptied and then a lot of queries would have to be compiled concurrently causing CPU spikes. By shortening the compile time for some queries – and of course we DBAs can use Query Store to force as many plans as we like (with caution) it’s hoped to reduce the occurrence of such events.

I’m not going to go into trying to demonstrate this in action, but Bob has a full demo in his book which you can buy here:

The feature is turned on by default for databases on SQL Server 2022 as long as they have query store enabled – even if they’re not on the latest compatibility level. You can however turn it off for a given database using:

ALTER DATABASE SCOPED CONFIGURATION SET OPTIMIZED_PLAN_FORCING = ON;

You can also turn it off at the level of an individual query. Either using the query hint DISABLE_OPTIMIZED_PLAN_FORCING, or if you force the plan using the sp.query_store_force_plan system stored procedure then you can disable it using the @disable_optimized_plan_forcing parameter.

SQL Server 2022 is Coming

We still don’t have an exact release date for SQL Server 2022 but at this stage we can have strong confidence that it will be with us before the end of the year – rather than slipping and becoming SQL Server 2023.


My first intimation that this was so, was earlier this year. I’ve been writing a book for Apress – “Pro Encryption in SQL Server 2022” that’s due to be released in a couple of weeks. At some point I questioned my editor about whether we should hold off as I didn’t want to be the author of a book referencing a version of a product that never existed. His answer was that Bob Ward says it’s happening, we’re releasing his book too, and it’s also got SQL Server 2022 in the title.


Given that Bob’s a Principal Architect for SQL Server with Microsoft, that gave me the confidence I needed. Bob’s book “SQL Server 2022 Revealed” was released a couple of weeks ago and goes over all the new features and enhancements in depth. I’ve started working my way through it and can highly recommend you get yourselves a copy. You get it currently at nearly a 70% discount through the following link:



As well as being a great technologist, Bob is an excellent and engaging writer. In addition to talking about the what the new features are, he gives insights into the design thinking that went into creating them which makes for interesting reading.
Here I reproduce Bob’s diagram that he calls the “Wheel of Power” showing the major new capabilities of SQL Server 2022:

I plan to blog about some of the new features I find most interesting and exciting over the next couple of weeks. I can’t wait to get dug in with the enhancements to Built-in Query Intelligence and Query Store. There are also enhancements to Always Encrypted which I’ve written about in my own book.


The cloud connected features are also very interesting and create new options for us in term of how we manage High Availability and Disaster Recovery as well as being able to offload parts of our workloads to the cloud.


Stay tuned, or if you can’t wait then grab a copy of Bob’s book and start working through the demos – there’s some great stuff. Like most releases on SQL Server, it’s not a quantum leap in computing, but there’s certainly enough for us DBAs to get our teeth stuck into and start thinking about what it offers us.

Setting up Transparent Data Encryption (TDE)

Transparent Data Encryption (TDE) is one of the easiest ways of encrypting your data at rest. In my prevous post we looked at what TDE is and how it works: What is Transparent Data Encryption?

In this post we go through the steps of setting it up.

You can set up TDE when you first create a database, or you can apply it to an existing database. In the latter case, once TDE has been enabled it will set to work encrypting your existing data in the background.

In either case the steps are the same. We’ll run through those quickly before going into more detail.

First you must have a Database Master Key (DMK) in the Master database, and a certificate that will be used by TDE:

USE MASTER;

CREATE MASTER KEY
ENCRYPTION BY PASSWORD = 'UseAStrongPasswordHere!£$7';
CREATE CERTIFICATE MyTDECert 
WITH SUBJECT = 'Certificate used for TDE in the TestTDE database';

This certificate is critical to you being able to access data encrypted by TDE, so you should make sure you back it up:

BACKUP CERTIFICATE MyTDECert   
TO FILE = 'C:\MyTDECert'  
WITH PRIVATE KEY   
(  
    FILE = 'C:\MyTDECert_PrivateKeyFile',  
    ENCRYPTION BY PASSWORD = 'UseAStrongPasswordHereToo!£$7'  
);

Then, in the database being encrypted with TDE you must create a Database Encryption Key (DEK) and specify the certificate:

USE TestTDE;

CREATE DATABASE ENCRYPTION KEY WITH ALGORITHM = AES_256
ENCRYPTION BY SERVER CERTIFICATE MyTDECert;

Finally, you turn encryption on:

ALTER DATABASE TestTDE SET ENCRYPTION ON;

And that’s all there is to it in practice.

A potential problem is that it is easy to set this up without really understanding it. Maybe that’s fine in many cases, but can you be sure that there’s nothing that can go wrong and have confidence that whatever the scenario, you will be able to get your data back? And can you be sure that your data is properly protected.

To gain that level of surety and to have confidence, I think it’s best to understand this in a bit more detail. In particular, I think it’s good to understand why each step is required and how the objects created are used.

So let’s go through those steps again in more detail.

Creating the Database Master Key (DMK)

CREATE MASTER KEY
ENCRYPTION BY PASSWORD = 'UseAStrongPasswordHere!£$7';

A DMK is used to protect other keys that are created in the database. It does this by encrypting them and only the encrypted value is stored. You can only have one DMK in a database.

The DMK itself is also stored encrypted, you can see when we created it we specified a password to encrypt it by. SQL Server also makes separate copy of the key encrypted by the Service Master Key (SMK). The SMK is the root level Key in SQL Server. This additional copy of the DMK means that SQL can access the actual value of your DMK without you having to specify the password again.

Creating the Certificate

CREATE CERTIFICATE MyTDECert 
WITH SUBJECT = 'Certificate used for TDE in the TestTDE database';

The certificate is going to be used in the next step down – to protect the Database Encryption Key (DEK) in your TDE enabled database. When you create a certificate, it contains an asymmetric key that can be used for encryption. An asymmetric key includes a public key that can be used to encrypt data and a private key that must be used if you want to decrypt data – that private key gets automatically protected (encrypted) by the DMK.

A logical question to ask is why we need the certificate? Why couldn’t we just protect the DEK in our TDE enabled database with the DMK from the master database directly?

Imagine the scenario that you need to migrate your database to another SQL Server instance. We can do this but we will need also to migrate the object that was used to protect/encrypt the DEK – which is itself stored in the database.

If TDE used the DMK to protect that then we would need to migrate the DMK to the new instance. But what if the new instance already had a DMK in the master database and objects that it was used to protect – such as other databases using TDE. At this point we would be stuck, we can’t migrate our DMK without overwriting the one that’s there, so we would have a choice, enable encryption for the migrated database but break it fore the existing ones, or vice versa.

Neither is a good option, but by having a certificate we can migrate that happily as we can have as many certificates as we want.

This also gives us the option that where we have multiple databases encrypted by TDE we could use a separate certificate for each. That means if one certificate is breached the others could remain protected.

This does raise a good point though that you may want one day to migrate your certificate, so call it something more meaningful and unique than “MyTDECert”.

Backing up the certificate

BACKUP CERTIFICATE MyTDECert   
TO FILE = 'C:MyTDECert'  
WITH PRIVATE KEY   
(  
    FILE = 'C:MyTDECert_PrivateKeyFile',  
    ENCRYPTION BY PASSWORD = 'UseAStrongPasswordHereToo!£$7'  
);

We backup the certificate in case we ever need to move or restore our database to a different server or SQL instance. It’s critical if we want to be able to access our encrypted data.

When you back it up, you specify a password to encrypt the key. What happens is SQL grabs the stored version of the Private Key (which is encrypted by the DMK) decrypts it, then re-encrypts it with the password. This means that you would be able to restore it to a different SQL instance where the DMK didn’t exist.

This covers us against the scenarios explained above regarding why we use a certificate rather than just relying on the DMK. It should also make it clear that if we need to migrate or recover the database all we need is:

  • The database backup
  • The certificate and key backups and the password used when creating them

Creating the Database Encryption Key (DEK)

CREATE DATABASE ENCRYPTION KEY WITH ALGORITHM = AES_256
ENCRYPTION BY SERVER CERTIFICATE MyTDECert;

In this, almost the final step, we create the actual key that will be used to encrypt the data. It’s stored in the database, but encrypted by the key associated with the certificate created in the previous step. The DEK is a symmetric key, i.e. the same value is used to encrypt or decrypt data.

It’s logical to ask why we don’t just use the private key from the certificate to encrypt the data directly. This is a bit more difficult to justify than the previous scenario with the DMK versus use of a certificate. It is certainly feasible that TDE could have been implemented this way.

One consideration is that the certificate is created with an asymmetric key – these are easier to work with in some ways as we only need the public key to encrypt data so can keep the private key concealed most of the time. Asymmetric encryption however is slower that symmetric encryption so to reduce the performance impact of TDE we want to use a symmetric key.

The concept of a DEK was new in SQL 2008 and created specifically for TDE. It makes sense that if we are to have a separate DEK then it should be stored in the database itself. That way migration/recovery is eased as it minimises the number of objects required.

It’s worth noting that you can only have one DEK in each database.

Enabling Encryption

ALTER DATABASE TestTDE SET ENCRYPTION ON;

In sys.databases you can see which databases have TDE turned on by looking at the is_encrypted column:

SELECT name
FROM sys.databases
WHERE is_encrypted = 1;

TDE sys_databases

We can find more details in the sys.dm_database_encryption_keys server view. Let’s query looking at some particular columns of interest:

SELECT
   d.name,
   k.encryption_state,
   k.encryptor_type,
   k.key_algorithm,
   k.key_length,
   k.percent_complete
FROM sys.dm_database_encryption_keys k
INNER JOIN sys.databases d
   ON k.database_id = d.database_id;

Here’s what I see after I created my DEK but before I enable encryption:

TDE_State1

We can see information about the DEK. We also see encryption state which describes the current state of the database. The main values you’ll see are:
1 = Unencrypted
2 = Encryption in Progress
3 = Encrypted

If I now enable encryption on this database and run the query again:

TDE_State2

We see that both my database and the TempDB database are now encrypted.

We also see the percent_complete column, which confusingly says zero. This column only has meaning when a state change is occurring. So, if the encryption state was 2 – then we would see a value here whilst the database was in the process of being encrypted. Here my database only had one row, so it was fairly instantaneous to flip encryption on.

This column becomes relevant when we are encrypting an existing database that has a reasonable amount of data, we’ll look at that next:

Encrypting an existing database with TDE

More articles about TDE

What is Transparent Data Encryption?

Understanding Keys and Certificates with Transparent Data Encryption (TDE)

How Secure is Transparent Data Encryption (TDE) – and How to Prevent Hacking

What is Transparent Data Encryption?

Transparent Data Encryption (TDE) was introduced in SQL 2008 as a way of protecting “at rest” data. It continues to be available in all versions of SQL right up until the present, until recently it was only available in the Enterprise editions of SQL Server but from SQL 2019 it was made available in standard edition.

When we talk about “at rest” data we are referring to data that has been written to disk. In terms of our SQL databases that includes:

  • Any data files for our database
  • Any log files for our database
  • All backup files for the database, be they Full, Log or Differential backups
  • Database snapshot files
  • Any data written to disk in the TempDB database

The last item in that list, TempDB, needs to be included for completeness. Imagine that you query your database and as part of the query execution TempDB is used. If that data was written to disk then that creates a hole in our protection, someone could potentially read or copy the TempDB files and might see some of the data we are trying to protect. As a result when you enable TDE against any database on your SQL Server instance, the TempDB database is automatically encrypted as well to prevent this from happening.

Data “at rest” of course doesn’t include the following things:

  • Data loaded/stored in memory (buffer pool)
  • Data returned from a query and being passed across the network
  • Data received by a client as a result of a query

If you want to cover those scenarios as well then you need to look at other forms of encryption e.g. TLS and Always Encrypted.

There are also some less obvious exceptions which occur where SQL doesn’t use the buffer pool – and therefore there isn’t an in-memory version of the data:

  • Filestream data
  • Data persisted to disk using Buffer Pool Extensions

And there are a couple of other exceptions that can occur in certain circumstances:

  • Where the buffer pool gets paged to disk due to memory pressure
  • SQL dump files when there is a crash

That’s summarised in the below diagram:

TDE Encrypted vs Unencrypted

TDE mainly uses standard encryption protocols based on AES (Advanced Encryption Standard). When you set up TDE you can specify which AES algorithm you wish to use, AES_128, AES_192 or AES_256. In each case the number specifies the length of the key to be used for encryption in bits.

Obviously the longer your key, the harder the encryption should be to crack, however even for AES_128, estimations of how long it would take to break down the key by brute force vary between a thousand years, to numbers many times greater than the age of the universe – trillions of years. The difference is based on how we anticipate processing power to grow in the future. Even with the lowest estimates AES_128 should be sufficient in most scenarios but most people seem to go for AES-256 which should take the same time squared to be beaten.

Up to 2016, SQL also supported the TRIPLE_DES_3KEY encryption protocol. This is now generally not considered to be as secure as AES, and from SQL 2016 its use is deprecated. So, it is best if you stick to AES even if you are on a SQL version where DES is an option.

Let’s have a look at contents of some SQL data files so you can see the difference with and without TDE. I’ve created a database with a single table and inserted a row of data:

CREATE DATABASE TestTDE;
USE TestTDE;
CREATE TABLE dbo.SomeData(Id INT IDENTITY(1,1), SomeText VARCHAR(255));
INSERT INTO dbo.SomeData (SomeText) VALUES('This is my data');

I’ll close my connection from the database, and detach it so I can open the files in a Hex Editor. Then I search for my text in the data file:

TDE_MDF_File_Unencrypted

As you can see the data is stored clear as day in the data file.

Now let’s look at the same data file once TDE has been enabled. This time if I search for my string it’s not found and my data looks like this:

TDE_MDF_File_Encrypted

Even where the previous file was all zeros where there was free space at the end, the encrypted version also has those encrypted:

TDE_MDF_File_Encrypted_End

TDE Works by using an encryption key that is stored in the database being encrypted – but that key is itself stored encrypted by an object outside of the database. We’ll see all the various objects involved when we look at setting up TDE next:

Setting up Transparent Data Encryption (TDE)

More articles on TDE:

Encrypting an existing database with TDE
Understanding Keys and Certificates with Transparent Data Encryption (TDE)
How Secure is Transparent Data Encryption (TDE) – and How to Prevent Hacking

Setting the Sample Rate for Automatic Statistics updates

Statistics are vital in allowing SQL Server to execute your queries in the most performant manner. Having a deeper understanding of how Statistics work really helps when you are performance tuning

I talked a few posts ago about Automatic Sample Sizes for Statistics Updates. From SQL 2016 CU4 we’ve been able to override that. You can manually update a statistics object specifying a given sample rate, and specify that that sample rate is what will be used for any subsequent automatic updates on the same object.

You can’t change the default sample rate globally, but this functionality is actually more useful. It means that if you find that you have a particular statistics object where the automatic sampling rate is too low, and that has a negative effect on query performance, then you can choose a sampling rate specific for that object.

Let’s take a quick look at doing this in practice.

Here’s some reused SQL from the previous post to create a table and throw in a million rows:

--Create table for Testing
CREATE TABLE dbo.Test(
   Id INT IDENTITY(1,1) CONSTRAINT PK_Test PRIMARY KEY CLUSTERED, 
   TextValue VARCHAR(20) NULL
);

--Create index on TextValue
CREATE INDEX IX_Test_TextValue ON dbo.Test(TextValue);

--Insert a bunch of rows
INSERT INTO dbo.Test(TextValue) 
SELECT TOP 1000000 'blah'
FROM sys.objects a, sys.objects b, sys.objects c, sys.objects d; 

I haven’t updated the stats yet. As we can see if I view them:

--View the statistics
DBCC SHOW_STATISTICS('dbo.Test', IX_Test_TextValue) WITH STAT_HEADER;

SetStatsSampleRate1

The stats haven’t any information yet but there is still one thing to notice. See that final column Persisted Sample Percent? That was added to support this functionality.

The stats will only get updated by the auto stats mechanism when I run a query against the table and the optimiser is interested in them but sees they’re stale. From zero to a million rows should be stale enough. Let’s run a quick query:

--Random Query
SELECT * 
FROM dbo.Test
WHERE TextValue = 'not blah’;'

And view the stats again:

SetStatsSampleRate2

We can see the stats got updated and the table was big enough that SQL decided to sample rather than scan the whole index. About 42% of the rows got sampled. Persisted Sample Percent is set to zero, i.e. it is not set.

Let’s say that I want to make sure that this statistics object always updates with a full scan.

What I can now do is manually update the statistics and specific that the sample percentage should be persisted:

--Update the stats and persist the sample rate
UPDATE STATISTICS dbo.Test IX_Test_TextValue 
WITH FULLSCAN, PERSIST_SAMPLE_PERCENT = ON;

Let’s check that did what we wanted by viewing the stats:

SetStatsSampleRate3

Great, we can see all the rows were sampled and the Persisted Sample Percent is now set to 100.

Now let’s see what happens when another auto update kicks in. I insert another million rows with the same SQL from above. Then I run the select query again. The optimiser notices the statistics are now stale and so it updates them and recompiles the plan for the query. We can see this when we check the stats yet again:

SetStatsSampleRate4

SQL has done exactly what we wanted, it performed the auto stats update in the background and it used the 100% persisted sample rate.

Thanks for reading.

Viewing the Statistics Objects Used to Create an Execution Plan

Statistics are vital in allowing SQL Server to execute your queries in the most performant manner. Having a deep understanding of how the SQL Server Optimizer interacts with Statistics really helps when you are performance tuning

One thing that can be useful when looking at an execution plan is to understand what statistics objects the optimizer used to come up with the plan. In this post we look at how that can be achieved using the undocumented traceflag 8666 which can be used to save internal debugging informational into the plan XML – including details of the Statistics objects used. 

There are actually a few other trace flags that do similar things but this seems the simplest and the one that works across the most versions of SQL Server. It looks like it was introduced in SQL 2008 and works on all versions up to and including SQL 2016 – but it doesn’t work on SQL 2005.

Note that this technique only works for plans generated once the trace flag is on, so you can’t view the additional information for existing plans in your cache:

Here’s a quick example using it against the AdventureWorks2012 database. You need to make sure you’ve selected the option to show the actual plan, then run the query as below:

--Turn the trace flag on
DBCC traceon (8666);

--Run the query
SELECT *
FROM Person.Person
WHERE LastName LIKE 'Smith%';

--Turn the trace flag off again
DBCC traceoff (8666);

This query produces the following plan:

optimizer_stats_plan

If you right-click over the SELECT operator and select properties you see the following:

optimizer_stats_plan_props

You’ll notice an extra field “Internal Debugging Information” which contains a bunch of unformatted XML. Rather than grabbing that string and formatting it to make it readable, you can right-click back on the plan itself and select “Show Execution Plan XML…” to view the same information in a more friendly format.

If you scroll down through the XML generated you will get to the following section:

<ModTrackingInfo>

   <Field FieldName=“wszStatName” FieldValue=“_WA_Sys_00000007_693CA210” />

   <Field FieldName=“wszColName” FieldValue=“LastName” />

   <Field FieldName=“m_cCols” FieldValue=“1” />

   <Field FieldName=“m_idIS” FieldValue=“2” />

   <Field FieldName=“m_ullSnapShotModCtr” FieldValue=“19972” />

   <Field FieldName=“m_ullRowCount” FieldValue=“19972” />

   <Field FieldName=“ullThreshold” FieldValue=“4494” />

   <Field FieldName=“wszReason” FieldValue=“heuristic” />

</ModTrackingInfo>

<ModTrackingInfo>

   <Field FieldName=“wszStatName” FieldValue=“IX_Person_LastName_FirstName_MiddleName” />

   <Field FieldName=“wszColName” FieldValue=“LastName” />

   <Field FieldName=“m_cCols” FieldValue=“1” />

   <Field FieldName=“m_idIS” FieldValue=“7” />

   <Field FieldName=“m_ullSnapShotModCtr” FieldValue=“19972” />

   <Field FieldName=“m_ullRowCount” FieldValue=“19972” />

   <Field FieldName=“ullThreshold” FieldValue=“4494” />

   <Field FieldName=“wszReason” FieldValue=“heuristic” />

</ModTrackingInfo>

Each ModTrackingInfo node displays the information about one statistics object that has been referenced. The wszStatName shows us the name of the Statistics object. We have two in this case, _WA_Sys_00000007_693CA210 and IX_Person_LastName_FirstName_MiddleName. The first is an auto-generated one on the LastName column (I happen to know this was generated before I created the index referenced). Interestingly SQL has looked at both – even though they contain pretty much the same info and one isn’t really required.

Why is this useful?

Well, apart from simply the interesting aspect of seeing some of what is going on in the background when SQL decides what plan to use for your query, it is often the case with poor performing queries that there is bad cardinality estimation going on. Sometimes this might be because the statistics aren’t accurate and would benefit from being updated. You can use this technique to see what statistics objects are being used for the estimation (where it is not just obvious) and then you can look at the objects themselves and see whether they might want refreshing, or whether ther may be other issues with your data – such as a skewed distribution – that make esimates difficult.

I see growing opinion in the SQL world that the refreshing of statistics is often more important than rebuilding indexes to reduce fragmentation – the latter operation will also refresh the statistics but is a lot more resource intensive. You also will commonly have statistics on columns that are not indexed, and the threshold for them to be refreshed can be pretty large when you have a lot of data. You can see more information about manually refreshing statistics in my previous posts on the subject:

When do Statistics Get Updated?

Manually updating Statistics

Automatic Sample Sizes for Statistics Updates

Automatic Sample Sizes for Statistics Updates

Statistics are vital in allowing SQL Server to execute your queries in the most performant manner. Having a deep understanding of how Statistics work really helps when you are performance tuning

I mentioned in my previous post about manually updating statistics that you can specify whether they’re updated using a full scan, or you can specify an amount of data to sample, either a percentage of the table size, or a fixed number of rows. You can also choose not to specify this, and SQL Server will decide for you whether to do a full scan, or to sample a certain amount of data.

I thought it would be interesting to look at what the sample sizes are that SQL will choose to use, depending on the amount of data in your table. Note that this occurs if you update statistics without specifying how they should be sampled as below:

UPDATE STATISTICS dbo.Test _WA_Sys_00000002_3AD6B8E2;

This is also the behaviour you will get when SQL updates statistics through the auto-stats mechanism. The fact that auto-stats may sample at a lower rate than is optimal for a given table and the queries against it is another reason you may choose to perform manual statistics updates.

To test this, I created a table and progressively pumped data in. Then after I inserted each batch of rows:

  • I Ran a stats update capturing the CPU time taken
  • Checked the statistics to see what sample size was used in the update
  • Checked the size of the index

Here’s some of the code I used for the test:

--Drop table if exists
IF (OBJECT_ID('dbo.Test')) IS NOT NULL DROP TABLE dbo.Test;

--Create table for Testing
CREATE TABLE dbo.Test(
   Id INT IDENTITY(1,1) CONSTRAINT PK_Test PRIMARY KEY CLUSTERED, 
   TextValue VARCHAR(20) NULL
);

--Create index on TextValue
CREATE INDEX IX_Test_TextValue ON dbo.Test(TextValue);

--Insert a bunch of rows
INSERT INTO dbo.Test(TEXTValue) 
SELECT TOP 100000 'blah'
FROM sys.objects a, sys.objects b, sys.objects c, sys.objects d;  

--Update statistics without specifying how many rows to sample
SET STATISTICS TIME ON;
UPDATE STATISTICS dbo.Test IX_Test_TextValue;
SET STATISTICS TIME OFF;

--View the statistics
DBCC SHOW_STATISTICS('dbo.Test', IX_Test_TextValue) WITH STAT_HEADER;

--Check the size of the index
SELECT
    i.name                  AS IndexName,
    SUM(s.used_page_count)   AS Pages
FROM sys.dm_db_partition_stats  AS s 
JOIN sys.indexes                AS i
ON s.[object_id] = i.[object_id] AND s.index_id = i.index_id
WHERE i.name = 'IX_Test_TextValue'
GROUP BY i.name

The results of my testing are shown in the below table:

StatisticsSamplingRates

You can see that we have a full sample being taken for the statistics updates up to 4000,000 records (896 pages) but that once the table size hits 500,000 sampling is happening. If you look at the number of pages you will see we now have over 1,000 pages, 1000 pages being about 8MB of data, which is the threshold that sampling kicks in.

I wasn’t able to find a nice neat formula to determine the sampling rate based on table size, but if we look at the above figures there are still some insights to be gained. The main one is that you’ll notice that even as we double the table size, the number of rows sampled doesn’t go up by much. For instance from 500,000 to a million rows, only 10,000 more rows are sampled. This also means that even for pretty large tables, the update isn’t taking long – another reason why it’s worth leaving auto stats updates enabled and running synchronously with queries – they’re generally not going to take that long.

Another insight is that the percentage of rows sampled drops off very quickly. As the sample size doesn’t really increase that much even when the table size doubles – the percentage sampled has almost halved each time.