An updated version of this post can be found here.
Statistics objects are important to us for allowing SQL to make good estimates of the row-counts involved in different parts of a given query and to allow the SQL Optimiser to form efficient execution plans to delivery those query results.
Statistics get updated automatically when you rebuild (or re-organise) an index they are based on – but we only tend to rebuild indexes that are fragmented, and we don’t need fragmentation for statistics to be stale. We also may have many auto-created statistics objects that are not related to an index at all.
It’s generally recommended to have the database level setting AUTO_UPDATE_STATISTICS turned on, so that SQL can manage the process of keeping statistics up to date for us. The only excuse to turn it off is that you are managing the updates to stats yourself in a different manner. And you can always turn the auto update off at an individual table or statistics level if you need to, rather than for the whole database.
SQL Server has had the ability to automatically update statistics since version 7.0. Nonetheless for a long part of my career working with SQL Server, whenever a performance issue raised its head everyone’s knee-jerk response would be “Update Statistics!” In most cases though the people shouting that didn’t really understand what the “Statistics” were, or what mechanisms might already be in place for keeping them up to date.
Of course SQL Server isn’t perfect and sometimes it is helpful for human intelligence to intervene. But to provide intelligent intervention one has to understand how things work.
So how does the automatic updating of statistics work?
In the background SQL maintains a count of changes to tables that might affect statistics. This can be updates, inserts or deletes. So if I inserted 100 records, updated 100 records and then deleted 100 records, I would have made 300 changes.
When SQL forms an execution plan for a query it references various distribution statistics objects to estimate row-counts and to use that to try find the best plan. The statistics objects it looks at are referred to as being “interesting” in the context of the query.
Before using values from the statistics, the Optimizer will check to see if the statistics are “stale”, i.e. the modification counter exceeds a given threshold. If it does, SQL will trigger a resampling of the statistics before going on to form an execution plan. This means that the plan will be formed against up to date statistics for the table.
For subsequent executions of the query, the existing plan will be loaded from the plan cache. Within the plan, the Optimiser can see a list of the statistics objects that were deemed “interesting” in the first place. Once again it will check each of them to see if they are “stale”. If they are, an auto-update of the statistics object(s) will be triggered and once that is complete the plan will be recompiled, in case the updated statistics might suggest a better way of executing the query. Equally, if any of the statistics objects have been updated since the last execution then the plan will also be recompiled.
One important caveat to this is the database level setting AUTO_UPDATE_STATS_ASYNC (Asynchronously). Generally it is best to have this turned off, in which case the above behaviour is observed. If you turn it on however, in the case of stale stats the query execution will not wait for the stats to be updated, but will start them updating in the background while the query executes. The plan will only recompile to be based on the new stats at the next execution.
From SQL Server2008 R2 SP2 and SQL Server 2012 SP1 we have a new DMF (Dynamic Management Function) sys.dm_db_stats_properties that allows us to see how many row modifications have been captured against a given statistics object as well as when it was last refreshed, how many rows were sampled etc. Modifications are captured on a per column basis (though when statistics were originally introduced in SQL Server it was per table) so the counter will only be affected if the leading column for the statistics object has been affected by a given operation.
s.name AS StatsName, sp.*
FROM sys.stats s
CROSS apply sys.dm_db_stats_properties(s.OBJECT_ID, s.stats_id) sp
WHERE s.name = 'IX_Test_TextValue'
So what are the thresholds?
For a long time the thresholds were as follows. Statistics were considered stale if one of the following was true:
- The table size has gone from 0 rows to more than 0 rows
- The table had 500 rows or less when the statistics were last sampled and has since had more than 500 modifications
- The table had more than 500 rows when the statistics were last sampled and the number of modifications is more than 500 + 20% of the row-count when the statistics were last sampled (when talking about tables with larger row-counts a lot of the documentation just describes this as 20% as the additional 500 becomes less and less relevant the larger the number you are dealing with).
Those thresholds did mean that when a table had a large number of rows, Statistics might not get updated that often. A table with a million rows would only have stats updated if about 200,000 rows changed. Depending on the distribution of the data and how it is being queried this could be a problem.
So, in SQL 2008 R2 SP2 Microsoft introduced Traceflag 2371 which when set would reduce the stale statistics threshold for larger tables. From SQL 2016 this is the default functionality.
That adds the following test for statistics being stale:
- If the number of rows (R) when the statistics were last sampled is 25,000 or more and the number of modifications is more than the square root of R x 1000:
Now, I’m just going to correct myself here, the documentation I’ve found SAYS the threshold is 25,000 but when I started to have a play that didn’t seem to be the case at all.
What actually seems to happen is that whichever of the two estimates is smaller gets used i.e
Whichever is smaller.
I don’t know if this means that both get evaluated and the smaller is used, or if the threshold between the two rules is simply defined at the point where the second formula gives the smaller result – which is after 19,682 rows. I discovered that threshold by solving where the two equations above would give the same result – then by experimenting to prove it in practice.
I think this incorrect stating of 25,000 as the threshold probably comes from confusion, taking an approximation (20%) as the actual figure. Remember I mentioned that people often generalise to say that statistics are stale after 20% of the rows change, and forget about the extra 500 rows. If that was true and it was exactly 20%, then the threshold would be 25,000 as that would be the point that both equations are equal.
Anyway it’s not even vaguely important to know that. I just found it interesting! Note that the tests above were carried out on SQL Server 2012 SP3 so could well be different on later versions.
To more visually understand the above rules, here’s a table showing the thresholds for some example table sizes under both the Old algorithm (without the traceflag) and the New algorithm (with the traceflag or on SQL 2016 or later).
R is the number of rows when the statistics were last sampled and T is the number of modifications for statistics to be considered stale:
You can see for the larger table sizes there is a massive difference. If you’ve got large tables you’re querying against and are having to update the statistics manually to keep them fresh then you may find implementing the traceflag is a help.
For large tables statistics are sampled when being updated rather than the whole table being necessarily being read. I have details on that in this post: