Statistics are vital in allowing SQL Server to execute your queries in the most performant manner. Having a deep understanding of how the SQL Server Optimizer interacts with Statistics really helps when you are performance tuning
One thing that can be useful when looking at an execution plan is to understand what statistics objects the optimizer used to come up with the plan. In this post we look at how that can be achieved using the undocumented traceflag 8666 which can be used to save internal debugging informational into the plan XML – including details of the Statistics objects used.
There are actually a few other trace flags that do similar things but this seems the simplest and the one that works across the most versions of SQL Server. It looks like it was introduced in SQL 2008 and works on all versions up to and including SQL 2016 – but it doesn’t work on SQL 2005.
Note that this technique only works for plans generated once the trace flag is on, so you can’t view the additional information for existing plans in your cache:
Here’s a quick example using it against the AdventureWorks2012 database. You need to make sure you’ve selected the option to show the actual plan, then run the query as below:
--Turn the trace flag on
DBCC traceon (8666);
--Run the query
WHERE LastName LIKE 'Smith%';
--Turn the trace flag off again
DBCC traceoff (8666);
This query produces the following plan:
If you right-click over the SELECT operator and select properties you see the following:
You’ll notice an extra field “Internal Debugging Information” which contains a bunch of unformatted XML. Rather than grabbing that string and formatting it to make it readable, you can right-click back on the plan itself and select “Show Execution Plan XML…” to view the same information in a more friendly format.
If you scroll down through the XML generated you will get to the following section:
Each ModTrackingInfo node displays the information about one statistics object that has been referenced. The wszStatName shows us the name of the Statistics object. We have two in this case, _WA_Sys_00000007_693CA210 and IX_Person_LastName_FirstName_MiddleName. The first is an auto-generated one on the LastName column (I happen to know this was generated before I created the index referenced). Interestingly SQL has looked at both – even though they contain pretty much the same info and one isn’t really required.
Why is this useful?
Well, apart from simply the interesting aspect of seeing some of what is going on in the background when SQL decides what plan to use for your query, it is often the case with poor performing queries that there is bad cardinality estimation going on. Sometimes this might be because the statistics aren’t accurate and would benefit from being updated. You can use this technique to see what statistics objects are being used for the estimation (where it is not just obvious) and then you can look at the objects themselves and see whether they might want refreshing, or whether ther may be other issues with your data – such as a skewed distribution – that make esimates difficult.
I see growing opinion in the SQL world that the refreshing of statistics is often more important than rebuilding indexes to reduce fragmentation – the latter operation will also refresh the statistics but is a lot more resource intensive. You also will commonly have statistics on columns that are not indexed, and the threshold for them to be refreshed can be pretty large when you have a lot of data. You can see more information about manually refreshing statistics in my previous posts on the subject:
Statistics are vital in allowing SQL Server to execute your queries in the most performant manner. Having a deep understanding of how Statistics work really helps when you are performance tuning
I mentioned in my previous post about manually updating statistics that you can specify whether they’re updated using a full scan, or you can specify an amount of data to sample, either a percentage of the table size, or a fixed number of rows. You can also choose not to specify this, and SQL Server will decide for you whether to do a full scan, or to sample a certain amount of data.
I thought it would be interesting to look at what the sample sizes are that SQL will choose to use, depending on the amount of data in your table. Note that this occurs if you update statistics without specifying how they should be sampled as below:
This is also the behaviour you will get when SQL updates statistics through the auto-stats mechanism. The fact that auto-stats may sample at a lower rate than is optimal for a given table and the queries against it is another reason you may choose to perform manual statistics updates.
To test this, I created a table and progressively pumped data in. Then after I inserted each batch of rows:
I Ran a stats update capturing the CPU time taken
Checked the statistics to see what sample size was used in the update
Checked the size of the index
Here’s some of the code I used for the test:
--Drop table if exists
IF (OBJECT_ID('dbo.Test')) IS NOT NULL DROP TABLE dbo.Test;
--Create table for Testing
CREATE TABLE dbo.Test(
Id INT IDENTITY(1,1) CONSTRAINT PK_Test PRIMARY KEY CLUSTERED,
TextValue VARCHAR(20) NULL
--Create index on TextValue
CREATE INDEX IX_Test_TextValue ON dbo.Test(TextValue);
--Insert a bunch of rows
INSERT INTO dbo.Test(TEXTValue)
SELECT TOP 100000 'blah'
FROM sys.objects a, sys.objects b, sys.objects c, sys.objects d;
--Update statistics without specifying how many rows to sample
SET STATISTICS TIME ON;
UPDATE STATISTICS dbo.Test IX_Test_TextValue;
SET STATISTICS TIME OFF;
--View the statistics
DBCC SHOW_STATISTICS('dbo.Test', IX_Test_TextValue) WITH STAT_HEADER;
--Check the size of the index
i.name AS IndexName,
SUM(s.used_page_count) AS Pages
FROM sys.dm_db_partition_stats AS s
JOIN sys.indexes AS i
ON s.[object_id] = i.[object_id] AND s.index_id = i.index_id
WHERE i.name = 'IX_Test_TextValue'
GROUP BY i.name
The results of my testing are shown in the below table:
You can see that we have a full sample being taken for the statistics updates up to 4000,000 records (896 pages) but that once the table size hits 500,000 sampling is happening. If you look at the number of pages you will see we now have over 1,000 pages, 1000 pages being about 8MB of data, which is the threshold that sampling kicks in.
I wasn’t able to find a nice neat formula to determine the sampling rate based on table size, but if we look at the above figures there are still some insights to be gained. The main one is that you’ll notice that even as we double the table size, the number of rows sampled doesn’t go up by much. For instance from 500,000 to a million rows, only 10,000 more rows are sampled. This also means that even for pretty large tables, the update isn’t taking long – another reason why it’s worth leaving auto stats updates enabled and running synchronously with queries – they’re generally not going to take that long.
Another insight is that the percentage of rows sampled drops off very quickly. As the sample size doesn’t really increase that much even when the table size doubles – the percentage sampled has almost halved each time.
Having up to date statistics is vital for getting the best performance out of your queries. Even though SQL Server automatically updates statistics in the background for you (When do statistics get updated?), you may find there are times when you want to manage updating them yourself.
You may have large tables and find that the interval between the automatic updates is too big and is resulting in sub-optimal query plans.
You might need timely updates to a specific object – maybe as part of an ETL process to make sure that statistics are up to date after a particular part of the process, perhaps after a daily load into a table.
You may find that the automatic updates look at too small a sample size and you need to scan more of the table to enable accurate estimates for your queries.
My previous post on the Ascending Key problem demonstrated a common issue where the first of these scenarios could be affecting you. We’ll look at examples of the other scenarios in subsequent posts.
For now though, let’s just look at how you go about updating statistics.
The other thing you may be likely to want to specify is whether the statistics should be updated using a full scan of the table, or just be looking at a sample of the rows. In the above examples we didn’t specify this so SQL Server will decide for us. In general sampling (as opposed to full scans) kicks in when we have about 8MB of data in the table (or about 1000 pages).
If you want to specify a full scan the syntax is as follows:
UPDATE STATISTICS dbo.Test _WA_Sys_00000002_3AD6B8E2 WITH FULLSCAN;
If you want the statistics update to use sampling (more on how this works in subsequent posts) then you can choose to specify a percentage of the total table to be sampled:
UPDATE STATISTICS dbo.Test _WA_Sys_00000002_3AD6B8E2 WITH SAMPLE 10 PERCENT;
Or you can specify a fixed number of rows:
UPDATE STATISTICS dbo.Test _WA_Sys_00000002_3AD6B8E2 WITH SAMPLE 10000 ROWS;
You might want to use a sample as once your tables get large full scans can take a little time. Equally though if you’re updating statistics in a quieter time (e.g. overnight) you may feel you can afford the extra time for the full scans. Here’s some comparison figures I produced on my local machine showing how long full scans take. Obviously this will change depending on your hardware and other factors including how much of the table is already in memory:
You can see however that the time taken pretty much scales linearly as the table size increases.
One thing to be aware of is parallelism. A full scan can run as a parallel operation on your server which can speed it up considerably. When you update statistics using a sample however this can only run single-threaded unless you’re on SQL Server 2016 (or higher). Sampling data to build statistics in parallel was one of the many excellent little improvements in 2016.
This is another method you might use for manually updating statistics (perhaps as part of a scheduled maintenance job). This system stored procedure can be used for updating all of the statistics objects in a database:
This stored procedure iterates through your database using a WHILE loop and executes the UPDATE STATISTICS command as it goes. One nifty thing about using this procedure is that it only updates statistics objects where rows have changed, so you don’t have any overhead for refreshing statistics where the underlying data hasn’t been modified. You can see this from this extract of the output of the stored procedure:
[PK__TestMemo__3214EC070D799003], update is not necessary…
0 index(es)/statistic(s) have been updated, 1 did not require update.
[PK__TestMemo__3214EC07D3DC52DE], update is not necessary…
0 index(es)/statistic(s) have been updated, 1 did not require update.
[PK_Test] has been updated…
[IX_Test_TextValue] has been updated…
2 index(es)/statistic(s) have been updated, 0 did not require update.
Of course, if you’re looking to implement statistics update as part of regular maintenance, then you should definitely be considering using Ola Hallengren’s maintenance solution. Ola maintains a great solution for managing database backups and integrity checks as well index and statistics maintenance, and he shares it with the SQL community for free.
If you work in the world of SQL Server you’ve almost certainly heard of dbatools. For those who haven’t, it’s an open source PowerShell module for automating literally hundreds of tasks on your database instances.
What the rest of you may or may not know, is that the creators of dbatools have been working on a book to make it easy to get started with the tool – Learn dbatools in a Month of Lunches
If you’re saying to yourself, it all sounds good, but I don’t really know PowerShell that well, I’m not confident how I work with modules, open source and otherwise, it all seems like a lot to learn for things I can do already another way… then I highly recommend you check out this book.
Written by Chrissy LeMaire and Rob Sewell this book takes you through the steps to get going before leading you through many of the tasks that dbatools can help you with.
Speaking for myself, I’m not a PowerShell afficionado, I’d struggle to write more than a few basic commands without googling the syntax. Following this book though made it really easy to get going, and I learnt a lot of PowerShell fundamentals along the way. It was certainly a lot easier than if I’d had to figure out things for myself, and it felt like I was aquiring a more complete knowledge that will stand me in good stead. Within a very short time of accessing the live book in the browser, I was set up and running commands with dbatools.
The book isn’t complete yet, but is being published through the Manning Early Access Programme, which means you can access chapters as they are published. Each chapter is a nice bite sized chunk you can consume fairly quickly and painlessly. Everything is explained really clearly – and there are comprehensive code samples you can work through.
If you’re a DBA who is aware of all this stuff, not sure you really need it, but have a niggling doubt that you’re getting left behind and probably should get around to learning it at some point, grab the book and start working through a few chapters. You’ll be operating with confidence in no time.
You can find out more about the book and access some preview content through the Manning website:
For most of us in the Northern Hemisphere the clocks go foward next month and we get to look forward to longer, sunnier evenings. This post is a reminder that if you have jobs scheduled in SQL Server Agent for the night that happens you may want to check they will not be impacted by the change.
In the UK our clocks change at 1 AM and jump forward by an hour. That means the hour usually between 1 AM and 2 AM doesn’t exist. Any jobs I have scheduled in that time will not happen.
In other countries you may have the change at a slightly different tme – in the US I believe it happens at 2 AM. If you’re in other countries you may need to check what time that happens if you’re not already aware of it.
If you have jobs that need to happen that would normally be scheduled in the “missing” hour then you need to re-schedule them. You may also want to consider permanently moving them to a different time window so you don’t have to worry about it again.
Or even better, you might want to consider having your servers on UTC so you don’t have this problem.
For what will happen in the Autumn when they go back again you can check out this post:
I’ve mentioned previously how not having up to date statistics can cause problems in query performance. This post looks at something called the Ascending Key Problem which can badly affect your cardinality estimation in some cases and therefore your execution plans.
The Ascending Key Problem relates to the most recently inserted data in your table which is therefore also the data that may not have been sampled and included in the statistics histograms. This sort of issue is one of the reasons it can be critical to update your statistics more regularly than the built-in automatic thresholds.
We’ll look at the problem itself, but also some of the mitigations that you can take to deal with it within SQL Server.
Imagine you have a table that stores a set of events. As new records are inserted they get stamped with the current date and time. You regularly query that table based on that EventDate looking to find recent events, let’s say just for the current day.
Even if you haven’t indexed the EventDate column (though why haven’t you?!), as long as you have AUTO CREATE STATISTICS and AUTO UPDATE STATISTICS on for your database you’ll have “reasonably” up to date statistics for that column.
But “reasonably” may not be good enough. When you’re querying against the most recent data it may not yet have been sampled by a statistics update and the range you are looking for may fall beyond the top of the histogram captured in the statistics object for EventDate. Imagine that statistics were last updated yesterday. When the Optimizer checks the statistics to estimate a rowcount for today’s date it finds that is above the top bound. So what should it guess?
Historically it would guess that there were zero rows, but as always the cardinality estimation gets set to the minimum of 1. If the real answer is a lot larger you might end up with a bad plan.
Let’s look at that in practice.
Staying true to the example above, I create a table called Events and I index the EventDate column:
CREATE TABLE dbo.Events
Id INT IDENTITY(1,1) CONSTRAINT PK_Events PRIMARY KEY CLUSTERED,
EventName VARCHAR(255) NOT NULL,
EventDate DATETIME CONSTRAINT DF_Events_EventDate DEFAULT (GETDATE())
);CREATE INDEX IX_Events_EventDate ON dbo.Events(EventDate) include (EventName);
Then I insert records to represent events at one minute intervals for 100 days:
--Insert data for 100 days at minute intervals from the start of this year
DECLARE @StartDate DATETIME = '20170101 00:00.00';
INSERT INTO dbo.Events(EventName, EventDate)
'Event' + CAST(num.n AS VARCHAR(10)),
SELECT TOP 144000 row_number() OVER(ORDER BY (SELECT NULL)) AS n
FROM sys.objects a, sys.objects b, sys.objects c
I’m going to query to check what date range was inserted. That should have the additional advantage of triggering a statistics update:
SELECT MIN(EventDate), MAX(EventDate)
As a slight digression, it’s interesting to look at the execution plan here:
You can see two index scans. That sounds horrendous, scan the index twice to find the MIN and MAX? If you look at the properties though you can see it only read one row in each case:
An index scan doesn’t have to read all the records in the index, it can bail out once it is satisfied. For a MIN or MAX type query it makes perfect sense just to jump to one end of the index and start scanning.
The side lesson is that Scans aren’t always bad for performance.
Anyway, back to the topic in hand. Now let’s look at the statistics:
You can see they’re up to date. They show 144,000 rows in total which is correct. Interestingly the Histogram (bottom result-set) only has a couple of steps. SQL has determined that the data is uniformly distributed so has bunched it altogether. Clever stuff!
Let’s insert data for another day:
--Insert one more day's data
DECLARE @StartDate DATETIME;
SELECT @StartDate = MAX(EventDate) FROM dbo.Events;
INSERT INTO dbo.Events(EventName, EventDate)
'Event' + CAST(num.n AS VARCHAR(10)),
SELECT TOP 1440 row_number() OVER(ORDER BY (SELECT NULL)) AS n
FROM sys.objects a, sys.objects b, sys.objects c
Now I query to see the new events. I captured the MAX(EventDate) earlier so let’s use that to find the new records:
WHERE EventDate > '20170411'
(Notice I’ve added the option to recompile so I get a fresh plan each time I run this, that will be important for testing)
Let’s not bother with the results, we all know there will 1,440 records that are basically the same. Here’s the execution plan:
The interesting bit comes when I look at the properties for the Index Seek:
Estimated number of rows = 1, Actual = 1,440. That’s quite a long way out. Of course here we have a trivial query so the massive underestimate isn’t affecting our plan. If we started joining to other tables though it would likely result in a massively inefficient plan – perhaps choosing a Nested Loops join over a Hash or Merge.
Note I’m using SQL Server 2012 for this test and I’m not using the Traceflag (2371) which reduces the threshold for statistics updates ( When do Statistics Get Updated? ):
So I’ve got nearly another 30,000 rows to insert before statistics get automatically updated and my estimates come into line. If I’m always querying for the current day then it’s going to be very rare that statistics are going to be able to help me with a good estimate.
So what’s the fix?
Before we get on to the methods that have been introduced to try and ameliorate this problem, if you face this sort of scenario you might want to consider whether you need to update your statistics objects more often than the auto-stats threshold. If you have a regular job to rebuild fragmented indexes then those indexes that get rebuilt will have their statistics refreshed – however that won’t cover the auto created statistics, and it won’t cover statistics for tables that get don’t get rebuilt.
So, if you don’t have a specific scheduled job to regularly update statistics that is definitely worth considering.
In terms of how SQL has changed to help us, from SQL Server 2005 SP1, the nature of columns began to be tracked, monitoring updates of statistics to understand how the data changes. This additional information can be seen if you enable traceflag 2388, then view the statistics. Let’s have a look at what’s gathered. First I’m going to add a couple more days of data, updating the statistics between each insert, then I run the following:
What you see here is historical information about the updates to the statistics. This is undocumented stuff, but some of what we see we can work out the meaning for. In particular we can see how many rows were inserted since the last statistics update, and how many of those values were above the top of the old histogram. We also see a column “Leading Column Type” which has a value of “Unknown”.
Now I’m going to insert another day’s date and update the statistics once more, then we’ll look at this again:
You can see that now we have a Leading Column Type of “Ascending”. After three updates to the statistics where the Leading Value was only increasing each time, SQL Server will identify that it is an ascending column. It must be at least three updates before SQL will recognise this, and if that stops being the case (i.e. some lower values are inserted) then the next statistics update will reset this until we again get three consecutive updates with only increasing values.
This happens in the background and you don’t need the traceflag 2388 to make it happen –that is just so you can see what is going on.
The obvious question is, now SQL knows my column is ascending, has that affected the estimation for my query? Before we look I’ll insert another day of data so there is some data beyond the histogram, and then I’ll query again:
WHERE EventDate > '20170415'
And the properties from the execution plan:
So nope. Nothing has changed.
To tell the query optimizer to take advantage of this extra information for ascending keys we have traceflag 2389. Let’s enable that and run the query again:
WHERE EventDate > '20170415'
Voila! SQL Server has now estimated my rowcount perfectly.
Now, be warned. This is a rather contrived example with a perfectly and artificially smooth distribution of data. The estimate is made by checking the current maximum value in the table, and using that combined with the information existing in the statistics and the value of your predicate to extrapolate a guess. If you’re data is evenly distributed as it is here then the guess will be pretty good, if it is fairly skewed then it may be bad.
In any case though it will probably be better that the fixed value of 1 that would have been used historically.
One thing to note is that traceflag 2389 is only going to have any affect if the leading column of the relevant statistics object has been marked as ascending. There is also traceflag 2390, and this will adopt a similar behaviour even if your column hasn’t been identified as ascending, i.e. it will check the maximum value in the table and if it is higher than the max value in the histogram, it will extrapolate to work out the cardinality estimate.
So should you turn on the traceflag(s) globally?
The Microsoft recommendation is not to enable traceflags such as these unless you are suffering from the specific problem they are aiming to resolve, and even then to make sure you test carefully to ensure they are achieving what you desire.
One issue can be that in more complex queries there are a number of cardinality estimates being made. It can be that two bad estimates within the same plan might cancel each other out and the query overall performs fine. If you then implement something that fixes one of them, you risk such queries going bad – a scenario known as plan regression.
This sort of scenario is one of the reasons why Microsoft have made very few core changes to the cardinality estimator since it came out.
So, use 2389 is you are specifically encountering this sort of ascending key problem, but also, if you are in the position to change the code then you might want to consider adding it as a query hint so it only affects the specific query you are targeting. For our example query above, that would simply look like:
WHERE EventDate > '20170415'
OPTION (RECOMPILE, QUERYTRACEON 2389);
Welcome to SQL Server 2014 (and later)
In 2014 we received a substantial revamp of the Cardinality Estimator, the first since SQL Server 7.0. A bunch of assumptions and algorithms have been re-jigged based on the real-world data that Microsoft have seen in supporting their wide customer base.
Key to having a new version was the concept that, henceforth, optimizer fixes would be tied to the database compatibility version. This means that customers can upgrade their SQL Server version but if they find performance problems related to the upgrade they can downgrade their database’s compatibility level while the issues are resolved within their codebase.
One of the items specifically looked at in the new version was this Ascending Key problem. To see how things work in the latest versions I’m going to repeat many of the steps above using a database deployed on SQL Server 2016.
Create my table again
Populate with the 100 days data
Run a query to check the dates, which has the added benefit of updating statistics
Add one more day’s data
Then I’m ready to run my test query again:
WHERE EventDate > '20170411'
I get the same execution plan as ever so again I jump to the properties of the Index Seek operator to look at the estimates:
Now, this is interesting. I might have expected I would get either 1 row estimated (the old model) or 1,440 (the model with traceflag 2389). Instead I get 432 rows. It seems the new CE (Cardinality Estimator) uses a different algorithm.
Sometimes numbers stick in your head. I happen to know that where no statistics are available and you are querying with an inequality predicate (<, > , <=, >=) that the CE will estimate the number of rows to be 30% of the total number of rows in the table. This assumption seems to have originated in a 1979 research paper from IBM suggesting 1/3 was a good guess.
With 30% in my head I noticed that 432 is 30% of 1440. So it seems that the optimizer is recognising that we are querying for values above the histogram (where no statistics exist) with an inequality, it knows from somewhere that there have been 1440 rows inserted since the last statistics update, so it takes 30% of 1440 to produce the estimate (432).
To try validate that theory I thought I’d query with a later datetime in the predicate. Sure enough, if I add 12 hours I still get 432 rows estimated. If I add 23 hours, 432 rows. In fact if I query for any date in the future, even outside of the maximum value in the table, guess what – I get an estimate of 432 rows.
I have a fascination for the algorithms involved in distribution statistics. It satisfies the maths geek in me. As such it’s difficult to end a post like this, there’s always more things to test, to try and work out. For instance what happens if you query across an interval that starts within the current histogram, but then extends above it? I’ll admit I’ve had a play, but will leave that for another post.
As a very final point in this post, I thought I’d just check whether the 2389 traceflag makes any difference to this estimation with the 2014 CE. I’ll change my query to look way into the future, enable the traceflag and look at the estimate:
WHERE EventDate > '99991231 23:59:59'
OPTION (RECOMPILE, QUERYTRACEON 2389);
Guess what? Still 432 rows… so no, the traceflag doesn’t appear to still give us any extra benefit.
Though when we get to that date it’ll be someone else’s problem to sort out!
Statistics are vitally important in allowing SQL Server to find the most efficient way to execute your queries. In this post we learn more about them, what they are and how they are used.
Cardinality is a term originally from Mathematics, generally defined as “The number of objects in a given set or grouping”. In SQL we’re continually dealing with sets so this becomes a very relevant topic, which in our context is just the “number of rows”.
When you have a query across multiple tables there any many ways in which SQL Server could decide to physically go about getting you the results you want. It could query and join the tables in any order and it could use different methods for matching records from the various tables you have joined together. It also needs to know how much memory to allocate for the operation – and to do that it needs to have an idea of the amount of data generated at each stage of processing.
A lot of this requires cardinality estimation, and SQL Server uses something called Statistics objects to perform that calculation.
Let’s look at a simple example: SELECT * FROM Person.Person p INNER JOIN Person.[Address] a ON p.AddressId = a.AddressId WHERE p.LastName = 'Smith' AND a.City = 'Bristol'
When it comes to gathering the results for this query there are a number of ways the database engine could go about it. For instance:
a) It could find all the records in the Person table with a LastName of Smith, look each of their addresses up and return only the ones who live in Bristol. b) It could find all the Addresses in Bristol, look up the people associated with each address and return only the ones called Smith. c) It could grab the set of people called Smith from the People table, grab all the addresses in Bristol, and only finally match up the records between those two sets.
Which of those operations is going to be most efficient depends very much on the number of records returned from each table. Let’s say that we have a million people called Smith, but there’s only one address in our whole database that is in Bristol (and let’s say that address does actually belong to someone called Smith).
In the first method above I would grab the million Smiths and then look their address up one by one in the address table until I found the one that lived in Bristol.
If I used method b) though, I would find the one matching address first, and then I would simply look up the owner of that address. Clearly in this rather contrived example, that’s going to be a lot quicker. So if SQL knows ahead of time roughly how many records to expect from each part of the query, hopefully it can make a good decision about how to get the data.
But how can it work out how many rows will be returned without actually running the query?
That’s where statistics objects come in. SQL Server maintains in the background data that equates to a histogram showing the distribution of the data in certain columns within a table. It does this any time you create an index – statistics will be generated on the columns the index is defined against, but it also does it any time it determines that it would be useful. So if SQL encounters a Where clause on Person.LastName – and that column isn’t involved in a useful index, SQL is likely to generate a statistics object to tell it about the distribution of data in that column.
I say “likely to” because it actually depends on the settings of your SQL instance. Server configuration is beyond the scope of this post but suffice to say you can let SQL automatically create Statistics objects – or not. You can let it automatically update them when the data has changed by more than a given threshold – or not. And you can specify whether updates to statistics should happen asynchronously or synchronously – i.e. in the latter case if your query determines that statistics needs updating then it will kick that off and wait until the update is complete before processing the query.
It’s generally recommended that auto creation and updating is on, and async updating is off.
Viewing Statistics Objects Let’s have a look at some actual statistics and see what they hold. There are a couple of ways of doing this, the first is through SSMS. If you look under a table in the object browser you will see a Statistics folder which holds any statistics objects relating to that table:
In the above example you can see some that have friendly names, these are Statistics that are related to an actual index that has been defined on the table and they have the same name as the Index – e.g. IX_Address_StateProvinceId.
You’ll also see we have some prefixed _WA_Sys and then with some random numbers following. These are statistics objects that SQL has created automatically on columns that aren’t indexed – or at least they weren’t indexed at the time the Statistics objects were created.
You can open these up with a double-click and see what’s inside:
This is the General tab you open up to. You’ll see it tells you what table the Statistics are for and what column(s). There are options for adding columns and changing the order – but you never really need to do this – as well as information to tell you when the statistics were last updated, and a check box if you want to update them now.
In the details tab there’s a lot more info:
I don’t find this the easiest display format to work with though, so rather than delving into what everything means here let’s look at the other way you can view statistics, which is by running the following command:
The format is straightforward, you just specify the table you are interested in the Statistics for, and the actual name of the Statistics object you want. You can see the same information as if you double-clicked on it, but the results are output in the results pane like any other query and are (I think) a lot easier to read. Allegedly there will soon be a third way in SQL Server to view Statistics as DBCC commands are considered a bit “clunky” – but we don’t know what that will look like yet.
The command outputs three resultsets:
This post is just an introduction to statistics – and generally you don’t need to know that much, it’s just handy to understand the basics. So let’s just run over the key bits of information you can see above:
First of all in the first recordset – otherwise know as the…
Rows – is the number of rows in your table
Rows Sampled – this is how many rows were sampled to generate the statistics. SQL can generate or update statsitics using sampling rather than reading all the rows. In this case you’ll see it did actually read the whole table.
Steps – If you imagine the statistics as a bar chart – this is the number of bars on the chart. Statistics objects have a maximum of 200 steps so if you have more distinct values in your column than that they will be grouped into steps.
Density – This is supposed to be the probability of a row having a particular value (calculated as 1 / Number of Distinct values in column). According to books online “This Density value is not used by the query optimizer and is displayed for backward compatibility with versions before SQL Server 2008.” I am using SQL 2012, and this number is just plain incorrect so don’t use it…
Recordset Number 2: The Density Vector
All Density – this is the accurate version of the Density statistic described above. So your probability of a given row having a specific value is about 0.0017. That’s a little less than one in 500. I happen to know there are 575 different Cities in the table so that makes sense. Sometimes SQL will use this value to form a plan – if it knows you’re going to search this table for a specific City and it doesn’t know that City when it makes the plan, then it could guess that about 1/500th of the rows will match your criteria.
Average Length – Is what it says on the can. The average length of data in this column.
Columns – The names of any column measured in this statistics objects. You can have statistics across multiple columns but I’m not going to cover that in this post. In this case it tells us these statistics are based on the “City” column.
Recordset Number 3: The Histogram
This last recordset shows the distribution of the data, and is what you could effectively use to to draw a graph of the relative frequencies of different groups of values. Each row represents a step – or bar on the bar chart – and as mentioned above there can be a maximum of 200 steps. So you can see, a statistics object is quite lightweight, even for a massive table.
RANGE_HI_KEY – This upper limit of each step, so each step contains all the values bigger than the RANGE_HI_KEY of the last step, right up to and including this value.
RANGE_ROWS – This is how many rows in the table fall in this range – not including the number that match the RANGE_HI_KEY itself.
EQ_ROWS – The number of rows equal to the HI_KEY
DISTINCT_RANGE_ROWS – The number of different values in the range that there is data for (excluding the HI_KEY).
AVERAGE_RANGE_ROWS – The average number of rows for a given value within the range.
That’s a whistle-stop tour of the Statistics SQL Server holds on your data.
The algorithms that SQL then uses to calculate the number of rows for a given part of your query are pretty transparent when there’s just one column involved. If we look at the above example and let’s say you wanted to look up the rows where the City is “Abingdon” – the statistics tell us there is 1 matching row and that’s the figure SQL will use for cardinality estimation. Where a value is within a range then it will use a calculation based on the AVERAGE_RANGE_ROWS.
The main takeaway from this should just be the understand the extent – and limitations – of the information about the distribution of your data that SQL holds in the background.
If, when you’re tuning queries, you notice that the estimated row counts don’t match the actual, then this could be encouraging SQL to form a bad plan for the query. In these cases you might want to investigate what’s going on with the statistics.
Maybe your query is written in a way that it can’t use statistics effectively, one example of this can be where you store constant values in variables, then query using that variable in a WHERE clause. SQL will then optimise based on the average, rather than on your actual value.
Maybe the plan is based on one data value that has a very different cardinality to the one currently being queried. For instance when you first run a stored procedure, the plan is formed based on the parameters passed. Those parameters could have a cardinality that is quite different to those used in later executions.
Maybe the statistics are out of date and need refreshing. Stats get updated when approximately 20% of the data in the table changes, for a large table this can be a big threshold, so current statistics may not always hold good information about the data you are querying for.
Or maybe SQL is doing as good a job as it can with the information it has at its disposal. You might need to work out how you can give it a little extra help.
Here are some other recent posts about Statistics you may find useful:
I find this is a question that comes up again and agan. What estimate for the number of rows returned does SQL Server use if you’re selecting from a column where there are no statistics available?
There are a few different algorithms used depending on how you’re querying the table. In this post we’ll look at where we have a predicate looking for a fixed value.
(If you want the short answer, it’s the fourth root of n cubed before SQL 2014 and the square root of n after that – where n is the number of rows)
This scenario can occur if you have AUTO CREATE STATISTICS turned off for your database, which we don’t recommend you do, but which you might choose to do anyway, and if you query a table with a predicate against a column with no index defined against it.
Let’s look at example querying against the AdventureWorks2012 database. I’ll start by looking at SQL Server 2012 then we’ll see how it behaves in later versions.
I’ve taken the following preparatory steps:
Set AUTO CREATE STATISTICS OFF for the database
Remove the Index on the LastName column for the Person.Person table
Removed any ad-hoc statistics that existed against the table
Then I run a simple query, with the Actual Execution Plan turned on:
FROM Person.Person p
WHERE p.LastName = 'Fox';
I only get one result out as there is only one Fox (Ms. Dorothy J.). Let’s look at the execution plan:
A clustered index scan as we might expect as I’ve removed any useful indexes from the table. You’ll notice there is a warning. If we view the tooltip you’ll see SQL warns us about the lack of statistics:
If we look at the estimated and actual row-counts we’ll see how that has affected us:
In the absence of any useful information – it knows the number of rows in the table but that is about it – SQL has estimated that there will be 1680 Foxes in the table. A bit of playing shows that we get the same estimate whatever value we search for.
If I turn AUTO CREATE STATISTICS on and run the query again then SQL generate a Statistics object against the LastName column and comes up with an estimate of 2.3 rows – which is a lot closer.
This matters a lot once you start running more complicated queries. The incorrect estimate is likely to affect the choice of plan that the optimizer makes, and may also affect the amount of memory it requests in order to run the query. Let’s look at a quick example of how the plan changes if we join the above query to another table.
First, without statistics (so I have to turn AUTO CREATE off again, and remove the statistics that got created):
SET STATISTICS IO ON;
FROM Person.Person p
INNER JOIN Person.EmailAddress e
ON p.BusinessEntityID = e.BusinessEntityID
WHERE LastName = 'Fox';
Here’s the execution plan:
You can see I’ve got a Merge Join as SQL thinks I’m expecting 1680 rows from the top table. A Nested Loops join would generally be better when I only expect one or two rows from that table.
I’ve also captured the IO so I can see how expensive the query was:
There’s not much difference in the overall IO, but you can see the Reads for the EmailAddress table have dropped from 10 to 2 due to the change from the Scan to the Seek. If the table was a lot bigger then we could see a large difference here.
So where does that estimate come from?
I thought I’d have a play and see if I could work out how SQL decided on that estimate of 1680 rows. I did some googling and found a suggestion that it might be a straight 9% of the total number of rows in the table, but that doesn’t quite add up and when I compared the same query pattern against a few tables I found I got a different ratio depending on the amount of rows in the table.
So I pumped rows incrementally into a fresh table and looked at the estimate and what the ratio was as the number of rows increased. Here’s my SQL for that:
--Create a Horrible Heap for my testing
CREATE TABLE dbo.TestStats(TestVal INT NOT NULL, TestText VARCHAR(255) NULL);
--Insert a bunch of rows using dodgy old-style cross joins
INSERT INTO TestStats
SELECT TOP 1 --Amount of rows I'm going to insert
1 , 'blah'
FROM sys.objects a, sys.objects b, sys.objects c, sys.objects d, sys.objects e;
--Clear the plan cache so SQL generated a new estimate
--Query the table, then I go check the execution plan generated
--for the estimated number of rows
SELECT * FROM dbo.TestStats
WHERE TestVal = 1;
(One thing to note was that I got the same answers whether I was querying the text column or the integer column – SQL seems to use the same algorithm for both.)
I started to notice a pattern quite quickly, that the ratio halved when the number of rows went up by a factor of 16. I then restarted my test, targeting my row-counts to be where the estimated number of rows would be a nice round number. You can see that in the table below:
I then attempted to work out a formula for that. Rather than take you through the shoddy process of mathematics that led me to an answer, I’ll just tell you that the formula came out as:
Where e is the estimated number of rows for a given predicate value, and n is the total number of rows in the table. I checked that against the full set of results I’d gathered and it held true across all values of n I’d tested.
To check it finally against my original query – the Person.Person table had 19,972 rows. I put that through the calculator with the formula and get 1680.027. If we look back at the original estimate you’ll see that SQL stated 1680.03 – so that is all good.
As I mentioned earlier I was using SQL Server 2012 for this test, and a new Cardinality Estimator came into effect in SQL 2014. So I thought I’d run the test again with SQL 2016 and see if the results changed:
We can see the estimated rows drop off a lot quicker here. Clearly Microsoft have decided to lower the estimate. Actually it is now just the square root of the total number of rows.
Hopefully you’re not in the scenario where you regularly have queries running without the appropriate statistics to support them. The above comparison though shows us that if you have such a query its behaviour could dramatically change if you are on on older version of SQL Server version and you upgrade. It could become better or it could become a lot worse.
There are a lot of changes like this that came in with the new version of the Cardinality Estimator in 2014. Places where underlying assumptions have been adjusted to make better guesses about the number of rows that will be returned by an operator. But they are still guesses based on the same information – there is no new data being captured in the Statistics to better inform the process. Of course Microsoft has made these changes to try and better model data out in the wild – but they are still fixed assumptions, which means sometimes they will be better and sometimes they will be worse.
One thing I should re-iterate is that these formulae we’ve discovered above are for a fairly specific querying pattern. There’s no guarantee that the calculation will be the same for a similar – but different query. It might be interesting to explore that further in a later post.
Also there may be other information in your database – such as constraints – that SQL can use to educate its guesses.
The main takeaway from all of this though, should of course be:
MAKE SURE AUTO CREATE STATISTICS IS TURNED ON FOR YOUR DATABASES!
Statistics objects are important to us for allowing SQL to make good estimates of the row-counts involved in different parts of a given query and to allow the SQL Optimiser to form efficient execution plans to delivery those query results.
Statistics get updated automatically when you rebuild (or re-organise) an index they are based on – but we only tend to rebuild indexes that are fragmented, and we don’t need fragmentation for statistics to be stale. We also may have many auto-created statistics objects that are not related to an index at all.
It’s generally recommended to have the database level setting AUTO_UPDATE_STATISTICS turned on, so that SQL can manage the process of keeping statistics up to date for us. The only excuse to turn it off is that you are managing the updates to stats yourself in a different manner. And you can always turn the auto update off at an individual table or statistics level if you need to, rather than for the whole database.
SQL Server has had the ability to automatically update statistics since version 7.0. Nonetheless for a long part of my career working with SQL Server, whenever a performance issue raised its head everyone’s knee-jerk response would be “Update Statistics!” In most cases though the people shouting that didn’t really understand what the “Statistics” were, or what mechanisms might already be in place for keeping them up to date.
Of course SQL Server isn’t perfect and sometimes it is helpful for human intelligence to intervene. But to provide intelligent intervention one has to understand how things work.
So how does the automatic updating of statistics work?
In the background SQL maintains a count of changes to tables that might affect statistics. This can be updates, inserts or deletes. So if I inserted 100 records, updated 100 records and then deleted 100 records, I would have made 300 changes.
When SQL forms an execution plan for a query it references various distribution statistics objects to estimate row-counts and to use that to try find the best plan. The statistics objects it looks at are referred to as being “interesting” in the context of the query.
Before using values from the statistics, the Optimizer will check to see if the statistics are “stale”, i.e. the modification counter exceeds a given threshold. If it does, SQL will trigger a resampling of the statistics before going on to form an execution plan. This means that the plan will be formed against up to date statistics for the table.
For subsequent executions of the query, the existing plan will be loaded from the plan cache. Within the plan, the Optimiser can see a list of the statistics objects that were deemed “interesting” in the first place. Once again it will check each of them to see if they are “stale”. If they are, an auto-update of the statistics object(s) will be triggered and once that is complete the plan will be recompiled, in case the updated statistics might suggest a better way of executing the query. Equally, if any of the statistics objects have been updated since the last execution then the plan will also be recompiled.
One important caveat to this is the database level setting AUTO_UPDATE_STATS_ASYNC (Asynchronously). Generally it is best to have this turned off, in which case the above behaviour is observed. If you turn it on however, in the case of stale stats the query execution will not wait for the stats to be updated, but will start them updating in the background while the query executes. The plan will only recompile to be based on the new stats at the next execution.
From SQL Server2008 R2 SP2 and SQL Server 2012 SP1 we have a new DMF (Dynamic Management Function) sys.dm_db_stats_properties that allows us to see how many row modifications have been captured against a given statistics object as well as when it was last refreshed, how many rows were sampled etc. Modifications are captured on a per column basis (though when statistics were originally introduced in SQL Server it was per table) so the counter will only be affected if the leading column for the statistics object has been affected by a given operation.
SELECT s.name AS StatsName, sp.* FROM sys.stats s CROSS apply sys.dm_db_stats_properties(s.OBJECT_ID, s.stats_id) sp WHERE s.name = 'IX_Test_TextValue'
So what are the thresholds?
For a long time the thresholds were as follows. Statistics were considered stale if one of the following was true:
The table size has gone from 0 rows to more than 0 rows
The table had 500 rows or less when the statistics were last sampled and has since had more than 500 modifications
The table had more than 500 rows when the statistics were last sampled and the number of modifications is more than 500 + 20% of the row-count when the statistics were last sampled (when talking about tables with larger row-counts a lot of the documentation just describes this as 20% as the additional 500 becomes less and less relevant the larger the number you are dealing with).
Those thresholds did mean that when a table had a large number of rows, Statistics might not get updated that often. A table with a million rows would only have stats updated if about 200,000 rows changed. Depending on the distribution of the data and how it is being queried this could be a problem.
So, in SQL 2008 R2 SP2 Microsoft introduced Traceflag 2371 which when set would reduce the stale statistics threshold for larger tables. From SQL 2016 this is the default functionality.
That adds the following test for statistics being stale:
If the number of rows (R) when the statistics were last sampled is 25,000 or more and the number of modifications is more than the square root of R x 1000:
Now, I’m just going to correct myself here, the documentation I’ve found SAYS the threshold is 25,000 but when I started to have a play that didn’t seem to be the case at all.
What actually seems to happen is that whichever of the two estimates is smaller gets used i.e
Whichever is smaller.
I don’t know if this means that both get evaluated and the smaller is used, or if the threshold between the two rules is simply defined at the point where the second formula gives the smaller result – which is after 19,682 rows. I discovered that threshold by solving where the two equations above would give the same result – then by experimenting to prove it in practice.
I think this incorrect stating of 25,000 as the threshold probably comes from confusion, taking an approximation (20%) as the actual figure. Remember I mentioned that people often generalise to say that statistics are stale after 20% of the rows change, and forget about the extra 500 rows. If that was true and it was exactly 20%, then the threshold would be 25,000 as that would be the point that both equations are equal.
Anyway it’s not even vaguely important to know that. I just found it interesting! Note that the tests above were carried out on SQL Server 2012 SP3 so could well be different on later versions.
To more visually understand the above rules, here’s a table showing the thresholds for some example table sizes under both the Old algorithm (without the traceflag) and the New algorithm (with the traceflag or on SQL 2016 or later).
R is the number of rows when the statistics were last sampled and T is the number of modifications for statistics to be considered stale:
You can see for the larger table sizes there is a massive difference. If you’ve got large tables you’re querying against and are having to update the statistics manually to keep them fresh then you may find implementing the traceflag is a help.
For large tables statistics are sampled when being updated rather than the whole table being necessarily being read. I have details on that in this post:
Query Store is a configuration that is enabled per database, and the plans and stats for queries executed in that database are stored in the database itself. So what does query store do when a query spans more than one database?
Does it record the execution stats in all databases involved or does it store them in one based on some criteria (e.g. the one where the most work occurs)? Or does it somehow proportion them out between the databases?
This was relevant as it crossed my mind that if it records them in multiple database then my query in the above post could be double counting.
Time to test and find out.
I created three databases, Fred, Bert and Ernie. Then a table called Fred in database Fred, and a table called Bert in database Bert. In table Fred I created a bunch of records, then in table Bert I created a much bigger bunch of records:
DROP DATABASE IF EXISTS Fred;
DROP DATABASE IF EXISTS Bert;
DROP DATABASE IF EXISTS Ernie;
CREATE DATABASE Fred;
CREATE DATABASE Bert;
CREATE DATABASE Ernie;
CREATE TABLE dbo.Fred(Id INT IDENTITY(1,1) PRIMARY KEY CLUSTERED, FredText NVARCHAR(500));
INSERT INTO dbo.Fred(FredText)
SELECT a.name + b.name
FROM sys.objects a, sys.objects b;
CREATE TABLE dbo.Bert(Id INT IDENTITY(1,1) PRIMARY KEY CLUSTERED, BertText NVARCHAR(500));
INSERT INTO dbo.Bert(BertText)
SELECT a.name + b.name + c.name
FROM sys.objects a, sys.objects b, sys.objects c;
Then I turned on Query Store for all three databases:
ALTER DATABASE Fred SET query_store = ON;
ALTER DATABASE Bert SET query_store = ON;
ALTER DATABASE Ernie SET query_store = ON;
Once that was done I concocted a horrible query that was bound to be horrendously slow – so I knew it would be easy to find when I queried the Query Store runtime stats:
SET STATISTICS IO ON
SELECT TOP 100000 *
FROM Fred.dbo.Fred f
INNER JOIN Bert.dbo.Bert b
ON b.BertText LIKE '%' + f.FredText + '%';
I turned STATISTICS IO on so I could see how much work was happening in each database.
So you can see most of the work occurs in the Bert database, a little in Fred, and none in Ernie.
Now let’s see what query store captured when I ran the query pointing at database Fred:
And pointing at database Bert:
And pointing at database Ernie:
You can see that the figures get recorded against whichever database you are pointing at – regardless of where the data being accessed resides. I left the “TotalLogicalReads %” in the above screen shots so you can see I’m not hiding anything.
This has a few implications. First, I’m happy because it means my “Expensive queries” script isn’t double counting.
Second though, as you can’t turn on query store on in any of the system databases, you won’t be able to capture details for any queries executed with those as the context. That includes ad-hoc queries where the user may connect to master, but execute queries against your other databases.
Fortunately (because Query Store would be pretty pointless if it did) this doesn’t apply to stored procedures.
I’m going to wrap my horrible query into a stored procedure, and deploy it into database Ernie:
CREATE PROCEDURE dbo.Horrible
SELECT TOP 100000 *
FROM Fred.dbo.Fred f
INNER JOIN Bert.dbo.Bert b
ON b.BertText LIKE '%' + f.FredText + '%';
Now I clear my Query Stores for the three database one last time. Then I’ll called the stored procedure from database Fred:
Here’s what I get from query store now:
So, Query Store logs the execution against database Ernie – where the stored procedure resides, rather than Fred – where it was called from, or Bert – where most of the work was done.
I hope you’ll trust me enough on that that I don’t have to demonstrate all the other combinations!