When do Statistics get updated?

June 28, 2023June 28, 2023 ~ Matthew McGiffen ~ 5 Comments

Statistics objects are important to us for allowing SQL to make good estimates of the row-counts involved in different parts of a given query and to allow the SQL Optimiser to form efficient execution plans to delivery those query results.

Statistics get updated automatically when you rebuild (or re-organise) an index they are based on – but we only tend to rebuild indexes that are fragmented, and we don’t need fragmentation for statistics to be stale. We also may have many auto-created statistics objects that are not related to an index at all.

It’s generally recommended to have the database level setting AUTO_UPDATE_STATISTICS turned on, so that SQL can manage the process of keeping statistics up to date for us. The only excuse to turn it off is that you are managing the updates to stats yourself in a different manner. And you can always turn the auto update off at an individual table or statistics level if you need to, rather than for the whole database.

SQL Server has had the ability to automatically update statistics since version 7.0. Nonetheless for a long part of my career working with SQL Server, whenever a performance issue raised its head everyone’s knee-jerk response would be “Update Statistics!” In most cases though the people shouting that didn’t really understand what the “Statistics” were, or what mechanisms might already be in place for keeping them up to date.

Of course SQL Server isn’t perfect and sometimes it is helpful for human intelligence to intervene. But to provide intelligent intervention one has to understand how things work.

So how does the automatic updating of statistics work?

In the background SQL maintains a count of changes to tables that might affect statistics. This can be updates, inserts or deletes. So if I inserted 100 records, updated 100 records and then deleted 100 records, I would have made 300 changes.

When SQL forms an execution plan for a query it references various distribution statistics objects to estimate row-counts and to use that to try find the best plan. The statistics objects it looks at are referred to as being “interesting” in the context of the query.

Before using values from the statistics, the Optimizer will check to see if the statistics are “stale”, i.e. the modification counter exceeds a given threshold. If it does, SQL will trigger a resampling of the statistics before going on to form an execution plan. This means that the plan will be formed against up to date statistics for the table.

For subsequent executions of the query, the existing plan will be loaded from the plan cache. Within the plan, the Optimiser can see a list of the statistics objects that were deemed “interesting” in the first place. Once again it will check each of them to see if they are “stale”. If they are, an auto-update of the statistics object(s) will be triggered and once that is complete the plan will be recompiled, in case the updated statistics might suggest a better way of executing the query. Equally, if any of the statistics objects have been updated since the last execution then the plan will also be recompiled.

One important caveat to this is the database level setting AUTO_UPDATE_STATS_ASYNC (Asynchronously). Generally it is best to have this turned off, in which case the above behaviour is observed. If you turn it on however, in the case of stale stats the query execution will not wait for the stats to be updated, but will start them updating in the background while the query executes. The plan will only recompile to be based on the new stats at the next execution.

From SQL Server2008 R2 SP2 and SQL Server 2012 SP1 we had a new DMF (Dynamic Management Function) sys.dm_db_stats_properties that allows us to see how many row modifications have been captured against a given statistics object as well as when it was last refreshed, how many rows were sampled etc. Modifications are captured on a per column basis (though when statistics were originally introduced in SQL Server it was per table) so the counter will only be affected if the leading column for the statistics object has been affected by a given operation.

SELECT s.name AS StatsName, sp.* FROM sys.stats s CROSS apply sys.dm_db_stats_properties(s.OBJECT_ID, s.stats_id) sp WHERE s.name = 'IX_Test_TextValue'

Results:

So what are the thresholds?

For a long time the thresholds were as follows. Statistics were considered stale if one of the following was true:

The table size has gone from 0 rows to more than 0 rows
The table had 500 rows or less when the statistics were last sampled and has since had more than 500 modifications
The table had more than 500 rows when the statistics were last sampled and the number of modifications is more than 500 + 20% of the row-count when the statistics were last sampled (when talking about tables with larger row-counts a lot of the documentation just describes this as 20% as the additional 500 becomes less and less relevant the larger the number you are dealing with).

Those thresholds did mean that when a table had a large number of rows, Statistics might not get updated that often. A table with a million rows would only have stats updated if about 200,000 rows changed. Depending on the distribution of the data and how it is being queried this could be a problem.

So, in SQL 2008 R2 SP2 Microsoft introduced Traceflag 2371 which when set would reduce the stale statistics threshold for larger tables. From SQL 2016 this is the default functionality.

That adds the following test for statistics being stale:

If the number of rows (R) when the statistics were last sampled is 25,000 or more and the number of modifications is more than the square root of R x 1000:

Now, I’m just going to correct myself here, the documentation I’ve found SAYS the threshold is 25,000 but when I started to have a play that didn’t seem to be the case at all.

What actually seems to happen is that whichever of the two estimates is smaller gets used i.e

Either:

Or:

Whichever is smaller.

I don’t know if this means that both get evaluated and the smaller is used, or if the threshold between the two rules is simply defined at the point where the second formula gives the smaller result – which is after 19,682 rows. I discovered that threshold by solving where the two equations above would give the same result – then by experimenting to prove it in practice.

I think this incorrect stating of 25,000 as the threshold probably comes from confusion, taking an approximation (20%) as the actual figure. Remember I mentioned that people often generalise to say that statistics are stale after 20% of the rows change, and forget about the extra 500 rows. If that was true and it was exactly 20%, then the threshold would be 25,000 as that would be the point that both equations are equal.

Anyway it’s not even vaguely important to know that. I just found it interesting! Note that the tests above were carried out on SQL Server 2012 SP3 so could well be different on later versions.

To more visually understand the above rules, here’s a table showing the thresholds for some example table sizes under both the Old algorithm (without the traceflag) and the New algorithm (with the traceflag or on SQL 2016 or later).

R is the number of rows when the statistics were last sampled and T is the number of modifications for statistics to be considered stale:

You can see for the larger table sizes there is a massive difference. If you’ve got large tables you’re querying against and are having to update the statistics manually to keep them fresh then you may find implementing the traceflag is a help.

For large tables statistics are sampled when being updated rather than the whole table being necessarily being read. I have details on that in this post:

Automatic Sample Sizes for Statistics Updates

Got a problem or embarking on a SQL Server project and want some help and advice? I’m available for consulting – please get in touch or check out my services page to find out what I can do for you.

What are Statistics in SQL Server?

June 21, 2023June 21, 2023 ~ Matthew McGiffen ~ 1 Comment

Statistics are vitally important in allowing SQL Server to find the most efficient way to execute your queries. In this post we learn more about them, what they are and how they are used.

Cardinality Estimation

Cardinality is a term originally from Mathematics, generally defined as “The number of objects in a given set or grouping”. In SQL we’re continually dealing with sets so this becomes a very relevant topic, which in our context is just the “number of rows”.

When you have a query across multiple tables there any many ways in which SQL Server could decide to physically go about getting you the results you want. It could query and join the tables in any order and it could use different methods for matching records from the various tables you have joined together. It also needs to know how much memory to allocate for the operation – and to do that it needs to have an idea of the amount of data generated at each stage of processing.

A lot of this requires cardinality estimation, and SQL Server uses something called Statistics objects to perform that calculation.

Let’s look at a simple example:SELECT * FROM Person.Person p INNER JOIN Person.[Address] a ON p.AddressId = a.AddressId WHERE p.LastName = 'Smith' AND a.City = 'Bristol'

When it comes to gathering the results for this query there are a number of ways the database engine could go about it. For instance:

a) It could find all the records in the Person table with a LastName of Smith, look each of their addresses up and return only the ones who live in Bristol.
b) It could find all the Addresses in Bristol, look up the people associated with each address and return only the ones called Smith.
c) It could grab the set of people called Smith from the People table, grab all the addresses in Bristol, and only finally match up the records between those two sets.

Which of those operations is going to be most efficient depends very much on the number of records returned from each table. Let’s say that we have a million people called Smith, but there’s only one address in our whole database that is in Bristol (and let’s say that address does actually belong to someone called Smith).

In the first method above I would grab the million Smiths and then look their address up one by one in the address table until I found the one that lived in Bristol.

If I used method b) though, I would find the one matching address first, and then I would simply look up the owner of that address. Clearly in this rather contrived example, that’s going to be a lot quicker. So if SQL knows ahead of time roughly how many records to expect from each part of the query, hopefully it can make a good decision about how to get the data.

But how can it work out how many rows will be returned without actually running the query?

Statistics

That’s where statistics objects come in. SQL Server maintains in the background data that equates to a histogram showing the distribution of the data in certain columns within a table. It does this any time you create an index – statistics will be generated on the columns the index is defined against, but it also does it any time it determines that it would be useful. So if SQL encounters a Where clause on Person.LastName – and that column isn’t involved in a useful index, SQL is likely to generate a statistics object to tell it about the distribution of data in that column.

I say “likely to” because it actually depends on the settings of your SQL instance. Server configuration is beyond the scope of this post but suffice to say you can let SQL automatically create Statistics objects – or not. You can let it automatically update them when the data has changed by more than a given threshold – or not. And you can specify whether updates to statistics should happen asynchronously or synchronously – i.e. in the latter case if your query determines that statistics needs updating then it will kick that off and wait until the update is complete before processing the query.

It’s generally recommended that auto creation and updating is on, and async updating is off.

Viewing Statistics Objects
Let’s have a look at some actual statistics and see what they hold. There are a couple of ways of doing this, the first is through SSMS. If you look under a table in the object browser you will see a Statistics folder which holds any statistics objects relating to that table:

In the above example you can see some that have friendly names, these are Statistics that are related to an actual index that has been defined on the table and they have the same name as the Index – e.g. IX_Address_StateProvinceId.

You’ll also see we have some prefixed _WA_Sys and then with some random numbers following. These are statistics objects that SQL has created automatically on columns that aren’t indexed – or at least they weren’t indexed at the time the Statistics objects were created.

You can open these up with a double-click and see what’s inside:

This is the General tab you open up to. You’ll see it tells you what table the Statistics are for and what column(s). There are options for adding columns and changing the order – but you never really need to do this – as well as information to tell you when the statistics were last updated, and a check box if you want to update them now.

In the details tab there’s a lot more info:

I don’t find this the easiest display format to work with though, so rather than delving into what everything means here let’s look at the other way you can view statistics, which is by running the following command:

DBCC SHOW_STATISTICS('Person.Address', '_WA_Sys_00000004_164452B1')

The format is straightforward, you just specify the table you are interested in the Statistics for, and the actual name of the Statistics object you want. You can see the same information as if you double-clicked on it, but the results are output in the results pane like any other query and are (I think) a lot easier to read. Allegedly there will soon be a third way in SQL Server to view Statistics as DBCC commands are considered a bit “clunky” – but we don’t know what that will look like yet.

The command outputs three resultsets:

This post is just an introduction to statistics – and generally you don’t need to know that much, it’s just handy to understand the basics. So let’s just run over the key bits of information you can see above:

First of all in the first recordset – otherwise know as the…

Stats Header

Rows – is the number of rows in your table

Rows Sampled – this is how many rows were sampled to generate the statistics. SQL can generate or update statsitics using sampling rather than reading all the rows. In this case you’ll see it did actually read the whole table.

Steps – If you imagine the statistics as a bar chart – this is the number of bars on the chart. Statistics objects have a maximum of 200 steps so if you have more distinct values in your column than that they will be grouped into steps.

Density – This is supposed to be the probability of a row having a particular value (calculated as 1 / Number of Distinct values in column). According to books online “This Density value is not used by the query optimizer and is displayed for backward compatibility with versions before SQL Server 2008.” I am using SQL 2012, and this number is just plain incorrect so don’t use it…

Recordset Number 2: The Density Vector

All Density – this is the accurate version of the Density statistic described above. So your probability of a given row having a specific value is about 0.0017. That’s a little less than one in 500. I happen to know there are 575 different Cities in the table so that makes sense. Sometimes SQL will use this value to form a plan – if it knows you’re going to search this table for a specific City and it doesn’t know that City when it makes the plan, then it could guess that about 1/500th of the rows will match your criteria.

Average Length – Is what it says on the can. The average length of data in this column.

Columns – The names of any column measured in this statistics objects. You can have statistics across multiple columns but I’m not going to cover that in this post. In this case it tells us these statistics are based on the “City” column.

Recordset Number 3: The Histogram

This last recordset shows the distribution of the data, and is what you could effectively use to to draw a graph of the relative frequencies of different groups of values. Each row represents a step – or bar on the bar chart – and as mentioned above there can be a maximum of 200 steps. So you can see, a statistics object is quite lightweight, even for a massive table.

RANGE_HI_KEY – This upper limit of each step, so each step contains all the values bigger than the RANGE_HI_KEY of the last step, right up to and including this value.

RANGE_ROWS – This is how many rows in the table fall in this range – not including the number that match the RANGE_HI_KEY itself.

EQ_ROWS – The number of rows equal to the HI_KEY

DISTINCT_RANGE_ROWS – The number of different values in the range that there is data for (excluding the HI_KEY).

AVERAGE_RANGE_ROWS – The average number of rows for a given value within the range.

That’s a whistle-stop tour of the Statistics SQL Server holds on your data.

The algorithms that SQL then uses to calculate the number of rows for a given part of your query are pretty transparent when there’s just one column involved. If we look at the above example and let’s say you wanted to look up the rows where the City is “Abingdon” – the statistics tell us there is 1 matching row and that’s the figure SQL will use for cardinality estimation. Where a value is within a range then it will use a calculation based on the AVERAGE_RANGE_ROWS.

When there’s multiple columns involved it’s more complicated there are various algorithms and assumptions that come into play. If you’re interested in digging deeper, one very good resource is the Whitepaper on the 2014 Cardinality Estimator written by Joe Sack: https://www.sqlskills.com/blogs/joe/optimizing-your-query-plans-with-the-sql-server-2014-cardinality-estimator/

Conclusions

The main takeaway from this should just be the understand the extent – and limitations – of the information about the distribution of your data that SQL holds in the background.

If, when you’re tuning queries, you notice that the estimated row counts don’t match the actual, then this could be encouraging SQL to form a bad plan for the query. In these cases you might want to investigate what’s going on with the statistics.

Maybe your query is written in a way that it can’t use statistics effectively, one example of this can be where you store constant values in variables, then query using that variable in a WHERE clause. SQL will then optimise based on the average, rather than on your actual value.

Maybe the plan is based on one data value that has a very different cardinality to the one currently being queried. For instance when you first run a stored procedure, the plan is formed based on the parameters passed. Those parameters could have a cardinality that is quite different to those used in later executions.

Maybe the statistics are out of date and need refreshing. Stats get updated when approximately 20% of the data in the table changes, for a large table this can be a big threshold, so current statistics may not always hold good information about the data you are querying for.

Or maybe SQL is doing as good a job as it can with the information it has at its disposal. You might need to work out how you can give it a little extra help.

Here are some other recent posts about Statistics you may find useful:

When do Statistics get updated?

Rowcount estimates when there are no Statistics

Indexes and Statistics with Always Encrypted

June 20, 2023June 20, 2023 ~ Matthew McGiffen ~ 1 Comment

In a previous post we looked at executing queries against columns encrypted using Always Encrypted. In this short post we look at the ability to be able to have (or not) indexes on those columns. This information is relevant to the case where you are using Always Encrypted without enclaves, we’ll look at working with enclaves later on.

Where we want to query with a predicate, it is often useful to have an index on the column being searched. You can only have indexes on columns using deterministic encryption – not randomized. This makes sense as a column with randomized encryption is going to have different values stored even when the underlying plaintext value is the same, so an index would be meaningless. You also can’t perform comparisons on such columns so an index wouldn’t be useful. An index on a column deterministically encrypted with Always Encrypted will simply be an index of the encrypted values and as such is only useful for equality comparisons, not for sorting or range queries – though neither of these functions are possible against our deterministically encrypted column anyway.

To create an index on a column using deterministic encryption is the same as if the column is not encrypted. Here is an example using the database and table created in the previous posts Setting Up Always Encrypted and Executing Queries Using Always Encrypted. In our table dbo.EncryptedTable we have the columns LastName (which is encrypted using deterministic encryption), and FirstName (which is encrypted using randomized encryption):

CREATE NONCLUSTERED INDEX IX_LastName ON dbo.EncryptedTable(LastName);

While we can’t index on our columns using randomized encryption, we can include them in the leaf pages of an index so that their values do not have to be retrieved using a Key Lookup operation. So the following SQL would also work:

CREATE NONCLUSTERED INDEX IX_LastName_Include_FirstName ON dbo.EncryptedTable(LastName) INCLUDE(FirstName);

In common with any other index, a statistics object is created that maintains information about the distribution of data in the index. We can view the data held in the statistics object with the following command:

DBCC SHOW_STATISTICS('dbo.EncryptedTable','IX_LastName');

Here we see the output:

We only have a couple of rows in our table, but if you’re familiar with statistics objects, you’ll see this is pretty much the same as we would see where there is no encryption. What is interesting is to look at the last result set. Normally this would show actual data values in the RANGE_HI_KEY column; here we can see that we have an encrypted value. From this, we can understand that the statistics are based on the encrypted values. This underlines the point already made, that our index is an index of the encrypted values of our data. That makes sense as SQL Server is never aware of the unencrypted values, so it wouldn’t be possible for it to create or maintain indexes (or statistics) of the plaintext values.

This is different when using always encrypted with enclaves where indexes on columns with randomized encryption are allowed. In that case, those indexes, and associated statistics objects, are actually based on the plaintext values. As mentioned, we’ll talk about how that works in a later post.

This post is part of a comprehensive series on SQL Server Encryption. Subscribe to my blog for updates as new posts are published or you can buy my book through the link in the sidebar to get it all in one go.

And if you’re embarking on an encryption project (or anything else to do with SQL Server) and want some help, I’m available for consulting – please get in touch or check out my services page to find out what I can do for you.

Under Promise and Overdeliver

June 19, 2023June 19, 2023 ~ Matthew McGiffen ~ Leave a comment

There’s a monthly “blog party” called T-SQL Tuesday that invites blog posts on a given topic related to SQL Server and data. This month Gethyn Ellis has asked people to share the best piece of advice they’ve received in their careers.

For me that’s an easy answer. “Under promise and overdeliver” is a practice that has held me in great stead all through my IT career. I first came across the phrase in my very first development job, working at BT back in the 90s. Our team had a manager, Mike Taylor, who would say this to us on a regular basis.

If you under promise and overdeliver then you will always have happy clients.

Let me give an example. If I’m asked to look at a poorly performing SQL Server instance for a client, and I know that it’s not had much love from an optimization point of view, then I can be pretty confident I can knock the CPU consumption in half and to be honest I’m going to be disappointed if I can’t reduce it by between 75 and 90%. The client however just wants it back running within acceptable parameters. If it’s running close to 100% CPU and causing timeouts in the client, they just want that under control. They will be happy if I get that back down to 70% and the timeouts go away – so that’s the expectation I will set as a minimum target. They will be very satisfied when I achieve that, very happy when I exceed that and get things down to 50% and ecstatic if it goes down to 25% or less – especially once I mention that they have the potential to reduce the size of kit the instance is running on and save a lot of money.

If I start out by saying I’ll reduce it by 75% but only reduce it by half then they will be satisfied, but I’ve given them a disappointing outcome compared to what I promised.

Another example is in timescales when promising delivery of work. It’s usually much better to say that a fix will take a day and deliver it early or on time, and to a high quality, than to say you can knock it out in a couple of hours and then having to keep pushing that back due to unforeseen problems. Worse, when you over-promise, you end up having to spend time on update calls and on resetting expectations, which means the task ends up taking you longer. You are also rushed and it’s much more likely that poor quality will creep in.

Related to timescales, another piece of advice I’ve come across was a rule of thumb for estimation. Basically, when trying to estimate how long a piece of work might take, give it your best estimate, then double it. You might be surprised how often that’s much more accurate than the original figure. Partly because we generally want to please and tend to be over-optimistic when trying to figure out how long something will take us.

We’ve all been on the other side of this. How frustrating is it when you have a problem with a service you use and contact support to be told it will be fixed in a given timeframe and that doesn’t happen. Or that you will be contacted by a certain time, but you don’t get a call or a message. More of your time gets wasted chasing for an update – which also wastes time for the organisation the promise came from. Much better to be given a realistic expectation at the outset – even if that’s disappointing. You can at least then look at how you’re going to work around the issue in the short-term.

It’s sometimes difficult, because you have to be firm about the expectation you set at the outset, but once you have delivered a solution, a fix, or other deliverable that exceeds that expectation, people are always happy and that it what they remember after the event. If you have over-promised, that it also likely to be what they remember.

So that’s the advice I was given and I pass it on to you – “Under promise and overdeliver”.

Got a problem and could do with some assistance? Get in touch today to see how I can help.

What Happens in the Background when Executing Queries with Always Encrypted

June 15, 2023June 15, 2023 ~ Matthew McGiffen ~ 1 Comment

In previous posts we’ve looked at a number of aspects of Always Encrypted and how it works. I think it’s very useful to understand how it actually works in practice, so in this post we’re going to look at what happens in the background when you execute a query.

I’m continuing on from the examples in the last few posts on this subject. So if you want to repeat the examples for yourself you’ll need to have followed steps from Setting Up Always Encrypted and Executing Queries Using Always Encrypted.

To capture what happens in the background we’re going to run an XEvent Profiler trace so we can see what actually gets sent to the database. We start an XEvent Profiler trace from SSMS. You generally find it as the bottom item for your server in the Object Explorer. We’ll use the TSQL option, so right-click on that and select Launch Session:

What Happens with an Insert Query

Once the session has opened, we re-run our insert query from the last post and see what is captured. Here is the SQL:

DECLARE @LastName nvarchar(50) = 'McGiffen'; DECLARE @FirstName nvarchar(50) = 'Matthew'; INSERT INTO dbo.EncryptedTable (LastName, FirstName) VALUES (@LastName, @FirstName);

You can see the results below

We can see there are two queries captured – normally we would just expect to see the one we executed. What we’re seeing here is what was described in What is Always Encrypted and how does it work? when we looked at the steps needed to execute a query over a connection with Always Encrypted enabled. Before executing the query itself, the client must issue a command to retrieve the encryption metadata. It needs to understand if any of the columns targeted by parameters are encrypted, and if so, it needs information about the keys. Let’s look at the query in full that is executed:

exec sp_describe_parameter_encryption N'DECLARE @LastName AS NVARCHAR (50) = @pacc87acf4618488b80bc61f6ac68114f; DECLARE @FirstName AS NVARCHAR (50) = @p4113aa748f2e4ff585556f8eaa618f0d; INSERT INTO dbo.EncryptedTable (LastName, FirstName) VALUES (@LastName, @FirstName); ',N'@pacc87acf4618488b80bc61f6ac68114f nvarchar(50),@p4113aa748f2e4ff585556f8eaa618f0d nvarchar(50)'

Retrieving the encryption metadata is achieved by this call to the sp_describe_parameter_encryption stored procedure. SSMS has parameterized our query and then sent that text over to be parsed by the engine. We can execute the same query call to sp_describe_parameter_encryption ourselves and see what is returned. We get two result sets. There’s a fair amount of information returned so I’ll display them as tables rather than screenshots to make things more readable. The first result set gives us information about any Always Encrypted keys we will need to execute our query. In this case there is only one key involved in our query so we just get one row returned, but if there were multiple keys required, we would get a row for each.

There are a few key columns here:

column_encryption_key_ordinal is an assigned ordinal value used to identify the key and is useful when multiple keys are involved in the query.
column_encryption_key_encrypted_value is the encrypted value of the CEK as stored in the database.
column_master_key_store_provider_name tells the client where the CMK is stored.
column_master_key_path identifies the path to the actual certificate used as the CMK.

Armed with this information, the client driver can obtain the local copy of the CMK and use that to decrypt the encrypted CEK value to obtain the actual encryption key.

The second result set gives details of which parameters target encrypted columns:

Here we have two parameters as we are inserting a value into each of the two columns. The results are fairly clear, but let’s just go over a few of the values and what they mean:

parameter_ordinal is an ordinal value for the parameter.
parameter_name is the name of the parameter as supplied to the sp_describe_parameter_encryption stored procedure.
column_encryption_type describes whether encryption is deterministic or randomized. 1 means randomized and 2 means deterministic. You might also see the value of 0 which means plaintext where a parameter targets an unencrypted column.
column_encryption_key_ordinal denotes which key in the first result set (where there are more than one) should be used.

Armed with all this information and with the CEK now decrypted, the client driver can encrypt the parameters as required, and finally issue the query to SQL Server. This is the second query we saw in our XEvent Profiler capture:

exec sp_executesql N'DECLARE @LastName AS NVARCHAR (50) = @pacc87acf4618488b80bc61f6ac68114f; DECLARE @FirstName AS NVARCHAR (50) = @p4113aa748f2e4ff585556f8eaa618f0d; INSERT INTO dbo.EncryptedTable (LastName, FirstName) VALUES (@LastName, @FirstName); ',N'@pacc87acf4618488b80bc61f6ac68114f nvarchar(50),@p4113aa748f2e4ff585556f8eaa618f0d nvarchar(50)',@pacc87acf4618488b80bc61f6ac68114f=0x01F82323F52B604A838ABC880ECDEB6CDD26ED47813F507A2EAA78FA1EE10FF47B2ED7C73C1A76580B6C0753A95DF5C944C5E590C2ED7E0AF59F1B4054317018584A9B8E3B4B0D9C4341B32DE2990E22C1,@p4113aa748f2e4ff585556f8eaa618f0d=0x012AC8899AACB8F1DDCEF4F6B2EB090F5E56687FDBB67D237E0E3D6D91C7F96C29F39396C633FB27DD92C7F2FABC18600D154FE1D426000CDB401ECD8BFD04AAC3

Here the parameterized version of our query has been issued to SQL Server with the plaintext parameters encrypted ready for insertion into the table. The long binary values at the end of the query represent the values “Matthew” and “McGiffen” encrypted using the correct CEK.

Before we move on to other queries, let’s pause and think about what this means for us in practice. The key thing to understand is that this call to sp_describe_parameter_encryption and the encryption of our parameters within the client driver is core to how Always Encrypted works. Plaintext values that target encrypted columns in the database must be sent via parameters. If your application only interacts through the database via parameterized queries – which is good practice – then you are in a good starting place in terms of implementing Always Encrypted. Otherwise, you are likely to have significant refactoring to do.

Another item worth discussing is the extra call that executing your query entails. It might seem this is going to add performance overhead. In reality the call is very lightweight with CPU consumption of less than a millisecond. On top of that, the results get cached locally so you don’t often see repeated calls to get the metadata for the same query. As a consequence of those two points, the performance overhead is generally negligible.

What Happens with a Select Query

Let’s repeat the exercise above and look at what happens when we execute our select query. We’ll start a new XE Profiler session and look at what is captured when we run the following select query:

SELECT * FROM dbo.EncryptedTable;

Here we see the XE Profiler output:

In this case we’re not seeing the extra call to get the metadata which can seem slightly confusing. Surely the client driver needs that information to be able to decrypt the encrypt columns that are returned. The reality is that that metadata does get sent back, but it doesn’t need to be asked for. The fact that our connection has encryption enabled means that SQL Server knows it needs to supply that information back to the client, so it does so automatically. In the former case with an insert, we needed the metadata before we could issue the query which is what necessitated the extra call.

For the select query, the results are returned to the client encrypted, along with the metadata. The client driver then decrypts any columns if required to do so before passing the results back to the client application.

Summary

We can see it is a fairly simple mechanism that supports executing queries against encrypted columns. A combination of automated actions in the client driver with calls to sp_describe_parameter_encryption to retrieve encryption metadata – such as which columns are encrypted, the encrypted Column Encryption Key (CEK) and the location of the Column Master Key (CMK).

It’s not strictly necessary to understand this – but I think it really helps in making clear why there are restrictions in place on how you can interact with encrypted columns, as well as to understand why certain things you may want to attempt will or wll not work.

Avoid “Constants” in T-SQL

June 14, 2023June 13, 2023 ~ Matthew McGiffen ~ 3 Comments

Unlike some other languages, T-SQL doesn’t have the concept of a constant.

As good coders, we’ve all at some point tried to use a SQL variable to hold a constant value within a stored procedure, in order to make our code both more readable and maintainable.

I’ll give you an example. Here I’m querying a Task table, and I want all the rows where the TaskStatus is 0 (zero) which means the Task is Open:

SELECT *
FROM dbo.Task 
WHERE TaskStatus = 0;

If someone else comes along to look at this code they don’t know what the value of zero means. My code’s not clear. Worse I might refer to the same value multiple times in my procedure, so if I need to change it later I have to change it in multiple places.

Good practice from other coding languages would say that I replace it with a meaningfully named constant. As mentioned, in T-SQL we don’t have constants so I’ll compromise and use a variable:

DECLARE @OpenTask tinyint = 0;

SELECT *
FROM dbo.Task 
WHERE  TaskStatus = @OpenTask;

Now that’s much more readable – right?

Unfortunately it’s also a bad idea in SQL . Let’s see why.

I’ll create the aforementioned Task table, add an index on TaskStatus. Then I’ll add million rows with status 1 (which we’ll call closed) and 1 row with the value 0 (zero) which is open:

CREATE TABLE dbo.Task
(
    Id INT IDENTITY(1,1) CONSTRAINT PK_Task PRIMARY KEY CLUSTERED,
    UserId INT,
    TaskType INT,
    Payload VARCHAR(255) NOT NULL,
    TaskStatus tinyint NOT NULL
);
GO

CREATE INDEX IX_Task_TaskStatus ON dbo.Task(TaskStatus);

INSERT INTO dbo.Task (UserId,TaskType,Payload,TaskStatus)
SELECT TOP 1000000 1,1,'This Shizzle Is Done',1
FROM sys.objects a, sys.objects b, sys.objects c;

INSERT INTO dbo.Task (UserId,TaskType,Payload,TaskStatus)
SELECT 1,1,'Do This Shizzle',0;

Once that’s completed I’m going to update the statistics just so we know SQL has the most up to date information to produce an optimal execution plan for our queries:

UPDATE STATISTICS dbo.Task WITH fullscan;

Now let’s go back to our original queries. Before I run them let’s think what we want them to do. We have an index on TaskStatus and we only have one row we are looking for, so we’d hope the query will use the index and go straight to the record. The index doesn’t contain all the columns, but that’s okay. We’re only going to have to output one record so if it has to look up the extra columns up in the clustered index that’ll be pretty damn quick.

Let’s run the first query, we’ll capture the execution plan and the STATISTICS output:

SET STATISTICS io ON;
SET STATISTICS time ON;

SELECT *
FROM dbo.Task 
WHERE TaskStatus = 0;

Here’s the execution plan:

That’s doing exactly what we hoped for, it’s looked up the record in our index using a seek. Then it’s grabbed the rest of the columns from the clustered index using a key lookup.

Here’s the statistics output:
Table ‘Task’. Scan count 1, logical reads 7, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 1 ms.

So that’s a nice small number of reads and less than a millisecond of CPU.

Now let’s run the “improved” version:

SET STATISTICS io ON;
SET STATISTICS time ON;

DECLARE @OpenTask tinyint = 0;

SELECT *
FROM dbo.Task 
WHERE TaskStatus = @OpenTask;

Here’s the execution plan this time:

That doesn’t look so good. Let’s check the statistics:
Table ‘Task’. Scan count 1, logical reads 5341, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

CPU time = 109 ms, elapsed time = 96 ms.

Those figures tell us the query has got between 100 and 1,000 times worse. So much for the improved version.

So why is this happening?

The answer is simply that the optimizer doesn’t/can’t look at the values inside variables when a piece of SQL is compiled. Therefore it can’t use the statistics against the indexes on the table to get an accurate idea of how many rows to expect back in the results.

We can see that if we compare the properties of the Index Seek Operator from the first query:

Against the properties for the Index Scan Operator from the second query:

In the first one we can see that the Actual Number of Rows (at the top) exactly matches the Estimated Number of rows (at the bottom). SQL has been able to use the statistics on the index to get an accurate estimate.

In the second this is not the case. We have 500,000 rows estimate, but only 1 actual. This has led SQL down the route of choosing a plan that would have been more effective for 500,000 rows – but is much less effective for 1. In this case it didn’t know what value to optimize for. Lacking that information it used the density value in the statistics and multiplied that by the total number of rows to get the estimate. Or in other words, the statistics tell it that there are two distinct values (0 and 1) in the table. Not knowing which one has been supplied the optimizer figures than on average half the rows will be returned.

So what should do you to make your code clearer?

The simple answer is to use comments, the following is totally clear to its meaning, and will perform optimally:

SELECT *
FROM dbo.Task 
WHERE TaskStatus = 0 -- Open Task;

But what about the maintainability issue, where you may have to refer to the same value multiple times in a given procedure?

Unfortunately you’re just going to have to put up with maintaining the value in multiple places, but in general within a well designed application these should be static values (and hopefully integers) so it shouldn’t be too big a deal.

Note this is not the same for parameters passed to a stored procedure. In that case the queries inside the stored procedure will compile using the values passed the first time the procedure was executed – that can be really useful, but it can also cause its own set of issues to be aware of! Just remember parameters and variables – similar but not the same!

Need Help?

Got a problem and could do with some assistance? Get in touch today to see how I can help.

Executing Queries Using Always Encrypted

June 13, 2023June 19, 2023 ~ Matthew McGiffen ~ 5 Comments

In this post we’ll look at how you interact with data that is encrypted using Always Encrypted. The examples here will show how you run queries from SSMS, in later posts we’ll look at stored procedures and application code – as well as what happens in the background when you execute a query.

For these examples we’re going to be using the database we created in the last post.

Performing a Basic Insert and Select

First we’ll look at how you should connect to a database when working with Always Encrypted. We’ll then look at the querying patterns you should use when working with encrypted columns.

Connecting to the Database

The first thing we need to do is to make sure we are using a connection to the database that has Always Encrypted enabled. You will want to be using a recent version of SSMS – at least version 18.

From SSMS you connect to the database engine as normal, and then in the Connect to Server dialog, you need to select the Options button. That then displays multiple tabs including one for Always Encrypted which we can see here:

You need to tick the Enable Always Encrypted box and then you can connect as normal. You will also notice the Enclave Attestation URL. We leave that blank for now, but we’ll use it when we look at Always Encrypted with Enclaves later.

Once you are connected, open a new query window. Right-click over the query window and select Query Options. Select the Advanced tab on the left – you should then see the GUI shown below. Right at the bottom of the available settings is Enable Parameterization for Always Encrypted; make sure this is ticked, and then click OK.

This setting tells SSMS that when you execute a query; the query should be converted to a parameterized query. Any variables used in the query will be converted to parameters. Those variables that target Always Encrypted columns will end up having their values encrypted before being sent to the database for execution.

If you forget to enable this, then when you attempt to issue a query on an encrypted connection, SSMS usually prompts to ask if you want to enable it.

Inserting Data

We’re now ready to try inserting some data. First, we’ll attempt something that will fail as it is an instructive example. Make sure your query window is set to the context of the TestAlwaysEncrypted database; then you can attempt to execute the following insert:

INSERT INTO dbo.EncryptedTable (LastName, FirstName) VALUES ('McGiffen', 'Matthew');

You will find you get an error similar to this:

Msg 206, Level 16, State 2, Line 1 Operand type clash: varchar is incompatible with varchar(8000) encrypted with (encryption_type = 'DETERMINISTIC', encryption_algorithm_name = 'AEAD_AES_256_CBC_HMAC_SHA_256', column_encryption_key_name = 'TestCEK', column_encryption_key_database_name = 'TestAlwaysEncrypted') collation_name = 'Latin1_General_CI_AS'

What the error message is telling you, in perhaps not very plain language, is that you are attempting to insert an unencrypted value into an encrypted column. The plaintext values must be encrypted before you attempt to insert them.

In order to make that happen, we have to use parameters. Try the following instead:

DECLARE @LastName nvarchar(50) = 'McGiffen'; DECLARE @FirstName nvarchar(50) = 'Matthew'; INSERT INTO dbo.EncryptedTable (LastName, FirstName) VALUES (@LastName, @FirstName);

Below is a screenshot of what should happen in your query window. You should see your variable declarations get a wavy blue underline. This denotes that the variable will be treated as a parameter. If you hover your mouse over one of the variables, you will see a pop-up message to that effect. SSMS will also parameterize your query before attempting to execute it. We’ll see what that means in the next post.

If you don’t see the wavy blue underline, then it is likely you haven’t enabled Parameterization for Always Encrypted. That will be confirmed if you execute the query and get the following error:

Encryption scheme mismatch for columns/variables '@LastName'. The encryption scheme for the columns/variables is (encryption_type = 'PLAINTEXT') and the expression near line '4' expects it to be DETERMINISTIC, or PLAINTEXT.

If all is configured correctly, when you execute the query, you will not receive an error and the row will be successfully inserted.

As a quick aside I want to show you something else that will fail as it demonstrates something that is important to understand. Let’s try a version of the insert query again with this SQL:

DECLARE @LastName nvarchar(100) = 'McGiffen'; DECLARE @FirstName nvarchar(100) = 'Matthew'; INSERT INTO dbo.EncryptedTable (LastName, FirstName) VALUES (@LastName, @FirstName);

The only difference here is that I’ve been a little bit lazy about the size of my variables. I’ve set them to nvarchar(100) where the column in the table is nvarchar(50). Normally this wouldn’t be a problem as long as I didn’t actually supply values longer than the column can hold. In the case of Always Encrypted, however, we get an error similar to the following when we execute this.

Msg 206, Level 16, State 2, Line 11 Operand type clash: nvarchar(100) encrypted with (encryption_type ='DETERMINISTIC', encryption_algorithm_name = 'AEAD_AES_256_CBC_HMAC_SHA_256', column_encryption_key_name = 'TestCEK', column_encryption_key_database_name = 'TestAlwaysEncrypted') is incompatible with nvarchar(50) encrypted with (encryption_type = 'DETERMINISTIC', encryption_algorithm_name ='AEAD_AES_256_CBC_HMAC_SHA_256', column_encryption_key_name = 'TestCEK', column_encryption_key_database_name = 'TestAlwaysEncrypted') Msg 8180, Level 16, State 1, Procedure sp_describe_parameter_encryption, Line 1 [Batch Start Line 6]

When working with Always Encrypted and parameters, your parameters must be strongly typed, the type and size used must match what is in the table. Implicit conversions are not allowed. You also cannot insert a previously encrypted value into the table; that will fail with the same sort of error. That prevents someone from attempting to replace an encrypted value with one from another record.

Modifying data with an UPDATE statement works exactly the same way. You must use a parameter for a new value that you want to be stored in the table. One interesting point to note is that if you update multiple rows at once to the same value from a parameter, then the rows will end up with the same encrypted value even when randomized encryption is being used. This is because the parameter only gets encrypted using randomized encryption once.

Reading Data

We can check the outcome of our insert with a simple select query as follows:

SELECT * FROM dbo.EncryptedTable;

Here are the results:

It’s logical at this stage to ask yourself how you know if the data actually got encrypted or not. As our connection has Always Encrypted enabled, the .NET library used by SSMS has automatically decrypted the results before sending them back to the application to be displayed. We can see the actual values stored in the database by opening a new query window and changing the connection. We keep the connection to the same server but go into the options and disable the “Enable Always Encrypted” option. Then we can execute the same select query again. This time we get results like these:

Here we see what is actually stored in the database and can see that our data has been encrypted.

Issuing a Query with a Predicate Against an Encrypted Column

So far we’ve just looked at a very simple select statement that returns all the results in the table. What if we want to search for a particular record? If we want to search based on the value in an unencrypted column – for instance, using the Id column our test table – this works exactly the same as if you didn’t have encryption. Decryption of the results happens automatically in the client driver, and functionality of our query is unaffected. What if we want to search for a particular value in an encrypted column – can we do that? The answer is that it depends; so let’s try it. First, let’s look at what is obviously not going to work. We’ll attempt to execute this SQL:

SELECT * FROM dbo.EncryptedTable WHERE LastName = 'McGiffen';

This fails with an error similar to the following:

What the error is telling us is that we are trying to compare the value in an encrypted column with a plaintext value which is a nonsense thing to try. Rather than just returning us zero results as there is no match, SQL returns us an error to point out that we are not going about things in the right way. If you’ve been following so far, it’s not going to surprise you that we need to use a parameter – as is done in this SQL:

DECLARE @LastName nvarchar(50) = 'McGiffen'; SELECT * FROM dbo.EncryptedTable WHERE LastName = @Lastname;

This time we get some results:

The LastName column in our table is encrypted with deterministic encryption. The FirstName column however is encrypted using randomized encryption. Let’s see what happens if we try to use the following SQL to query against the FirstName column using the same pattern:

DECLARE @FirstName nvarchar(50) = 'Matthew'; SELECT * FROM dbo.EncryptedTable WHERE FirstName = @FirstName;

In this case we receive an error:

Msg 33277, Level 16, State 2, Line 11 Encryption scheme mismatch for columns/variables 'FirstName', '@FirstName'. The encryption scheme for the columns/variables is (encryption_type = 'RANDOMIZED', encryption_algorithm_name = 'AEAD_AES_256_CBC_HMAC_SHA_256', column_encryption_key_name = 'TestCEK', column_encryption_key_database_name = 'TestAlwaysEncrypted') and the expression near line '5' expects it to be'DETERMINISTIC, or RANDOMIZED, a BIN2 col'ation for string data types, and an enclave-enabled column encryption key, or PLAINTEXT. Msg 8180, Level 16, State 1, Procedure sp_describe_parameter_encryption, Line 1 [Batch Start Line 6] Statement(s) could not be prepared. An error occurred while executing batch. Error message is: Internal error. Metadata for parameter '@pfe3eb6b759d54cdd8c38d6b5e8709633' in statement or procedure 'DECLARE @FirstName AS NVARCHAR (50) = @pfe3eb6b759d54cdd8c38d6b5e8709633; SELECT * FROM dbo.EncryptedTable WHERE FirstName = @FirstName; ' is missing in resultset returned by sp_describe_parameter_encryption.

It’s a long error, but the short translation is that we are trying to compare data in a parameter with data in a column that is using randomized encryption, and that is not possible. The error messages you receive when attempting to query Always Encrypted data are not always very clear, so it helps to know the sort of things that are likely to go wrong. Combining that knowledge with the information returned in the error – in this case it states that the encryption for the column is randomized – will help you to quickly understand what the issue is.

Let’s look at another query pattern you might attempt. This SQL specifies my surname in capitals:

DECLARE @Lastname nvarchar(50) = 'MCGIFFEN'; SELECT * FROM dbo.EncryptedTable WHERE LastName = @Lastname; GO

This is very similar to the first query that executed fine; however, in this case we get no results. Remember we saw that encrypted columns must use BIN2 collations which are case-sensitive. As my @LastName parameter doesn’t match the case of the values in the LastName column for the records in my table, SQL correctly returns no results. This can be problematic if you want to implement features like searching against encrypted columns where the search term may come from user input and users are used to not having to specify the correct case. There are ways you can work around that; I’ll mention one possibility – using a separate search catalog – in a later post.

Another pattern that will simply not work is the use of LIKE queries. The reason is that “Matthew” may be like “Matt” but the same cannot be said of their respective encrypted values. If we used a form of encryption where that was true, then it would end up being relatively weak and open to attack. If you attempt a LIKE query, you will simply get an error similar to the previous ones.

Summary

We can see that working with encrypted columns with Always Encrypted is reasonably straightforward, mainly we just have to understand what is possible and logical and what is not. It’s useful to be familiar with the sorts of errors you can see so you can quickly diagnose the problem when you have done something wrong.

Success vs. Scalability: Why your biggest success might also be your biggest disaster

June 12, 2023June 12, 2023 ~ Matthew McGiffen ~ Leave a comment

There is a scenario I have seen again and again. A company has developed a nice piece of software which many clients are running with happily. Then along comes a new client, the biggest client to date by far. Celebrations abound, plans are made to expand the business and the execs are wringing their hands with glee at the anticipated new revenue.

But then soon after go-live things start to go wrong. The application just doesn’t cope well with what are unprecedented loads. The database is under strain and even throwing more power at at – where that is an option – isn’t helping as much as would be hoped.

This is often where I get called in. Scalability of database applications is a real issue. Without proactive performance tuning and optimization, most systems will reach a tipping point once a certain load, or size of data is reached.

On the application side it’s often easy, you can scale-out and simply add more application servers. On the database end however, you often have no choice to rely on a single database instance. This is often a point that is used to criticize traditional OLTP databases and people will say you should use NoSQL or rely on things such sharding and eventual consistency. There is definitely a place for those things, but at the same time, speaking mainly from my experience working with SQL Server, there is no realistic load it can’t take as long as the application is designed well, and the database and server are properly tuned.

So how do you identify if you’re going to have a problem before you end up in the middle of a client emergency?

Load testing is one answer. I’ve worked with product teams in this exact circumstance, testing to see if the application and database will scale to the requirements of a new customer. It is hard though. It’s hard to accurately simulate a profile of usage that anticipates how a given client – or all clients – may use your system, and it’s a lot of effort.

The good news though is that you can identify most potential issues within database applications before they become critical ones. Often it’s clearcut; a query is not going to scale well unless it has better indexing to support it, a particular piece of code is going to create a lot of locking as data sizes grow, parallelism settings are not optimal and won’t support optimum performance at scale, the list goes on.

Even when it’s not clear, there’s usually enough information to raise suspicion. Then you can at least target load testing to those specific functions and see how they cope under strain.

This scenario isn’t just limited to new customers. Existing customers can grow over time and usually their databases certainly do. Unless you’re pro-active in identifying those potential issues early then you will reach performance tipping points where clients end up raising Sev 1 issues.

It’s often said that the best DBAs go un-noticed because this is their approach. They’ve identified and solved issues ahead of time before anyone else was even aware of them. The best performance tuning activities are invisible in terms of negative impact, and it’s always great to be able to tell clients you’ve made things better without being asked.

So, have you got a big new client just around the corner, or are your existing clients (or their databases) growing, and you haven’t done any tuning recently (or at all)?

Get in touch and I can perform a Comprehensive SQL Server Healthcheck for you. We’ll identify these sorts of issues, as well as any other hidden dangers that might be lurking around the corner ready to bite. We’ll discuss solutions and I’ll work with your team so they understand the issues in detail. Or if you’re have a need for any other SQL Server consultancy then check out my services page or give me a yell.

Parameter Sensitive Plan Optimization in SQL Server 2022. What is it and is it any good?

June 8, 2023June 15, 2023 ~ Matthew McGiffen ~ Leave a comment

What is it?

Parameter Sensitive Plan (PSP) optimization is a new feature in SQL Server 2022 that aims to improve the performance of parameterized queries. It is part of the Intelligent Query Processing family of features and addresses the scenario where a single cached plan for a parameterized query isn’t optimal for all possible incoming parameter values. This is the case with non-uniform data distributions.

PSP optimization automatically enables multiple, active cached plans for a single parameterized statement. Cached execution plans will accommodate different data sizes based on the actual provided runtime parameter value(s).

During the initial compilation, column statistics histograms identify non-uniform distributions and evaluate the most at-risk parameterized predicates, up to three out of all available predicates. In other words, if multiple predicates within the same query meet the criteria, PSP optimization chooses the top three.

For eligible plans, the initial compilation produces a dispatcher plan that contains the PSP optimization logic called a dispatcher expression. A dispatcher plan maps to query variants based on the cardinality range boundary values predicates.

Here’s an example to illustrate how PSP optimization works:

Let’s say we have a table Orders with columns OrderID, CustomerID, OrderDate, and TotalAmount. The table has a non-uniform distribution of data, with some customers having many orders while others have only a few.

We have a stored procedure that retrieves all orders for a given customer:

CREATE PROCEDURE dbo.GetCustomerOrders (@CustomerID int)
AS
BEGIN
   SELECT * FROM dbo.Orders WHERE CustomerID = @CustomerID;
END

Without PSP optimization, SQL Server would generate a single execution plan for this query, based on the parameters supplied the first time it is called, and cache it for reuse. However, this plan may not be optimal for all possible values of @CustomerID.

With PSP optimization enabled, SQL Server will generate multiple execution plans for this query – one for low cardinality ranges, one for medium cardinality ranges, and one for high cardinality ranges. At runtime, SQL Server will choose the optimal execution plan based on the value of @CustomerID.

For example, if @CustomerID corresponds to a customer with many orders (high cardinality), SQL Server will choose the high cardinality range execution plan. If @CustomerID corresponds to a customer with few orders (low cardinality), SQL Server will choose the low cardinality range execution plan.

This allows SQL Server to determine the optimal execution plan based on the parameter value(s), improving query performance.

PSP aims to reduce parameter sniffing, a situation where SQL Server generates an execution plan based on the first parameter value it encounters and then reuses that plan for all subsequent parameter values. This can lead to suboptimal performance for some parameter values.

In SQL Server 2022, PSP optimization is automatically enabled by default. However, it can be disabled at the database or query level using the following T-SQL statement:

ALTER DATABASE [database_name] SET PARAMETER_SENSITIVE_PLAN_OPTIMIZATION = OFF;

Is it any good?

The biggest problem in terms of fixing performance is that cardinality estimation is based on statistics objects which are effectively a histogram (with up to 200 steps) of data distribution across a certain column. In the example above, if we have statistics on CustomerID then we may have accurate figures if we have less than 200 customers, but if we have a lot more then they will be grouped in ranges and we may have a lot of data skew within a range, e.g. CustomerID 1 may have 10 orders where CustomerID 2 has 10,000. That means PSP optimization may still not always choose the correct plan.

Also, there are lots of parameter sniffing scenarios which can depend on a lot more than 3 parameters – such as “kitchen-sink” search procedures where you can search data based on one or more of many different parameters.

That said, there will be many scenarios where it does help – it’s just not a magic bullet to fix parameter sniffing.

Brent Ozar talks about another issue in his post How SQL Server 2022 Tries to Fix Parameter Sniffing. That is the issue around monitoring and performance troubleshooting. When you try to view the execution plan for a stored procedure where the optimizer has decided to use PSP optimization, you only see the dispatcher plan, not the actual plans that will be used in practice. Also, should your monitoring or performance troubleshooting tools flag up a query that is causing problems, where that query is using PSP optimization, it is difficult to link that query back to the stored procedure it belongs to. Not impossible, but it is a pain.

I attended a talk by Conor Cunningham, Principal Software Architect at Microsoft on SQL Server, a few years back at SQL Bits, where he was presenting the original set of features collectively known as Intelligent Query Processing. He pointed out that a lot of people running SQL Server, especially on Azure with Azure SQL Database and Azure SQL Managed Instance, don’t have SQL Server performance tuning expertise in house and these sorts of features were targeted at these people most of all. Features that will not solve all performance issues of a given type, but will certainly help in a number of circumstances.

From that point of view PSP optimization is a good addition to the set of features, even if it doesn’t do as good a job as someone with real expertise might, and even if it does come with it’s own problems.

Identify the (Top 20) most expensive queries across your SQL Server using Query Store

June 7, 2023June 15, 2023 ~ Matthew McGiffen ~ 2 Comments

I’m a big fan of using queries based on the dynamic management view sys.dm_exec_query_stats to identify the most resource hungry queries across a SQL instance. There are lots of versions of this sort of query around the place if you google for “Top 20 queries”.

That approach has some limitations though. First, it is cleared out every time an instance restarts, and second it only keeps figures for currently cached plans, so when a query recompiles, data is lost.

The DMVs provided by SQL Server Query Store solve these issues, as data is persisted in the database over time, so nothing is lost on restarts etc. And one extra thing you gain by using the Query Store DMVs is that you can slice by time interval, for instance if you want to look at the before and after states relating to a change that has been made – or you want to look at an interval where performance degradation has been reported.

Some time ago I wrote a query store version of the “Top 20 queries” query that will produce a ranked list of your most expensive queries – and I’ve ended up using this a lot.

The only downside of using the DMVs for Query Store is that they are per database whereas dm_exec_query_stats is a view across the whole instance. So I had to use a cursor and a temp table, populating the temp table for each database in turn.

Here’s the query:

--Gather and report on most resource hungry queries
DECLARE @Reportinginterval int;
DECLARE @Database sysname;
DECLARE @StartDateText varchar(30);
DECLARE @TotalExecutions decimal(20,3);
DECLARE @TotalDuration decimal(20,3);
DECLARE @TotalCPU decimal(20,3);
DECLARE @TotalLogicalReads decimal(20,3);
DECLARE @SQL varchar(MAX);

--Set Reporting interval in days
SET @Reportinginterval = 1;

SET @StartDateText = CAST(DATEADD(DAY, -@Reportinginterval, GETUTCDATE()) AS varchar(30));

--Cursor to step through the databases
DECLARE curDatabases CURSOR FAST_FORWARD FOR
SELECT [name]
FROM sys.databases 
WHERE is_query_store_on = 1;

--Temp table to store the results
DROP TABLE IF EXISTS #Stats;
CREATE TABLE #Stats (
   DatabaseName sysname,
   SchemaName sysname NULL,
   ObjectName sysname NULL,
   QueryText varchar(1000),
   TotalExecutions bigint,
   TotalDuration decimal(20,3),
   TotalCPU decimal(20,3),
   TotalLogicalReads bigint
);

OPEN curDatabases;
FETCH NEXT FROM curDatabases INTO @Database;

--Loop through the datbases and gather the stats
WHILE @@FETCH_STATUS = 0
BEGIN
    
    SET @SQL = '
	   USE [' + @Database + ']
	   INSERT intO #Stats
	   SELECT 
		  DB_NAME(),
		  s.name AS SchemaName,
		  o.name AS ObjectName,
		  SUBSTRING(t.query_sql_text,1,1000) AS QueryText,
		  SUM(rs.count_executions) AS TotalExecutions,
		  SUM(rs.avg_duration * rs.count_executions) AS TotalDuration,
		  SUM(rs.avg_cpu_time * rs.count_executions) AS TotalCPU,
		  SUM(rs.avg_logical_io_reads * rs.count_executions) AS TotalLogicalReads
	   FROM sys.query_store_query q
	   INNER JOIN sys.query_store_query_text t
		  ON q.query_text_id = t.query_text_id
	   INNER JOIN sys.query_store_plan p
		  ON q.query_id = p.query_id
	   INNER JOIN sys.query_store_runtime_stats rs
		  ON p.plan_id = rs.plan_id
	   INNER JOIN sys.query_store_runtime_stats_interval rsi
		  ON rs.runtime_stats_interval_id = rsi.runtime_stats_interval_id
	   LEFT JOIN sys.objects o
		  ON q.OBJECT_ID = o.OBJECT_ID
	   LEFT JOIN sys.schemas s
		  ON o.schema_id = s.schema_id     
	   WHERE rsi.start_time > ''' + @StartDateText + '''
	   GROUP BY s.name, o.name, SUBSTRING(t.query_sql_text,1,1000)
	   OPTION(RECOMPILE);';

    EXEC (@SQL);

    FETCH NEXT FROM curDatabases INTO @Database;
END;

CLOSE curDatabases;
DEALLOCATE curDatabases;

--Aggregate some totals
SELECT 
    @TotalExecutions = SUM(TotalExecutions),
    @TotalDuration = SUM (TotalDuration),
    @TotalCPU  = SUM(TotalCPU),
    @TotalLogicalReads = SUM(TotalLogicalReads)
FROM #Stats

--Produce output
SELECT TOP 20
    DatabaseName,
    SchemaName,
    ObjectName,
    QueryText,
    TotalExecutions,
    CAST((TotalExecutions/@TotalExecutions)*100 AS decimal(5,2)) AS [TotalExecutions %],
    CAST(TotalDuration/1000000 AS decimal(19,2)) AS [TotalDuration(s)],
    CAST((TotalDuration/@TotalDuration)*100 AS decimal(5,2)) AS [TotalDuration %],
    CAST((TotalDuration/TotalExecutions)/1000 AS decimal(19,2)) AS [AverageDuration(ms)],
    CAST(TotalCPU/1000000  AS decimal(19,2)) [TotalCPU(s)],
    CAST((TotalCPU/@TotalCPU)*100 AS decimal(5,2)) AS [TotalCPU %],
    CAST((TotalCPU/TotalExecutions)/1000 AS decimal(19,2)) AS [AverageCPU(ms)],   
    TotalLogicalReads,
    CAST((TotalLogicalReads/@TotalLogicalReads)*100 AS decimal(5,2)) AS [TotalLogicalReads %],
  CAST((TotalLogicalReads/TotalExecutions) AS decimal(19,2)) AS [AverageLogicalReads]   
FROM #Stats
--Order by the resource you're most interested in

--ORDER BY TotalExecutions DESC
--ORDER BY TotalDuration DESC
ORDER BY TotalCPU DESC
--ORDER BY TotalLogicalReads DESC

DROP TABLE #Stats;

The script limits itself to looking at databases where query store is enabled.

If you want to bring back more results you can just change the TOP statement, and if you want to look at the results ordered by a different resource (e.g. Reads) then just make sure the relevant ORDER BY clause is uncommented. With other small modifications I find this script useful in a myriad of scenarios. I hope you find it useful too.

How does Query Store capture cross database queries?

Introduction to SQL Server Query Store

Matthew McGiffen Data

Month: June 2023

When do Statistics get updated?

What are Statistics in SQL Server?

Cardinality Estimation

Statistics

Stats Header

Recordset Number 2: The Density Vector

Recordset Number 3: The Histogram

Conclusions

Indexes and Statistics with Always Encrypted

Under Promise and Overdeliver

What Happens in the Background when Executing Queries with Always Encrypted

What Happens with an Insert Query

What Happens with a Select Query

Summary

Avoid “Constants” in T-SQL

So why is this happening?

So what should do you to make your code clearer?

But what about the maintainability issue, where you may have to refer to the same value multiple times in a given procedure?

Need Help?

Executing Queries Using Always Encrypted

Performing a Basic Insert and Select

Connecting to the Database

Inserting Data

Reading Data

Issuing a Query with a Predicate Against an Encrypted Column

Summary

Success vs. Scalability: Why your biggest success might also be your biggest disaster

Parameter Sensitive Plan Optimization in SQL Server 2022. What is it and is it any good?

What is it?

Is it any good?

Identify the (Top 20) most expensive queries across your SQL Server using Query Store