Statistics and the Ascending Key Problem

July 25, 2017 ~ Matthew McGiffen ~ 2 Comments

I’ve mentioned previously how not having up to date statistics can cause problems in query performance. This post looks at something called the Ascending Key Problem which can badly affect your cardinality estimation in some cases and therefore your execution plans.

The Ascending Key Problem relates to the most recently inserted data in your table which is therefore also the data that may not have been sampled and included in the statistics histograms. This sort of issue is one of the reasons it can be critical to update your statistics more regularly than the built-in automatic thresholds.

We’ll look at the problem itself, but also some of the mitigations that you can take to deal with it within SQL Server.

The Problem

Imagine you have a table that stores a set of events. As new records are inserted they get stamped with the current date and time. You regularly query that table based on that EventDate looking to find recent events, let’s say just for the current day.

Even if you haven’t indexed the EventDate column (though why haven’t you?!), as long as you have AUTO CREATE STATISTICS and AUTO UPDATE STATISTICS on for your database you’ll have “reasonably” up to date statistics for that column.

But “reasonably” may not be good enough. When you’re querying against the most recent data it may not yet have been sampled by a statistics update and the range you are looking for may fall beyond the top of the histogram captured in the statistics object for EventDate. Imagine that statistics were last updated yesterday. When the Optimizer checks the statistics to estimate a rowcount for today’s date it finds that is above the top bound. So what should it guess?

Historically it would guess that there were zero rows, but as always the cardinality estimation gets set to the minimum of 1. If the real answer is a lot larger you might end up with a bad plan.

Let’s look at that in practice.

Staying true to the example above, I create a table called Events and I index the EventDate column:

CREATE TABLE dbo.Events
(
Id INT IDENTITY(1,1) CONSTRAINT PK_Events PRIMARY KEY CLUSTERED,
EventName VARCHAR(255) NOT NULL,
EventDate DATETIME CONSTRAINT DF_Events_EventDate DEFAULT (GETDATE())
);

CREATE INDEX IX_Events_EventDate ON dbo.Events(EventDate) include (EventName);

Then I insert records to represent events at one minute intervals for 100 days:

--Insert data for 100 days at minute intervals from the start of this year
DECLARE @StartDate DATETIME = '20170101 00:00.00';

INSERT INTO dbo.Events(EventName, EventDate)
SELECT
    'Event' + CAST(num.n AS VARCHAR(10)),
    DATEADD(minute,num.n,@StartDate)
FROM 
(
    SELECT TOP 144000 row_number() OVER(ORDER BY (SELECT NULL)) AS n 
    FROM sys.objects a, sys.objects b, sys.objects c
) num;

I’m going to query to check what date range was inserted. That should have the additional advantage of triggering a statistics update:

SELECT MIN(EventDate), MAX(EventDate)
FROM dbo.Events;

As a slight digression, it’s interesting to look at the execution plan here:

AscendingKey1

You can see two index scans. That sounds horrendous, scan the index twice to find the MIN and MAX? If you look at the properties though you can see it only read one row in each case:

AscendingKey2

An index scan doesn’t have to read all the records in the index, it can bail out once it is satisfied. For a MIN or MAX type query it makes perfect sense just to jump to one end of the index and start scanning.

The side lesson is that Scans aren’t always bad for performance.

Anyway, back to the topic in hand. Now let’s look at the statistics:

DBCC SHOW_STATISTICS('dbo.Events',IX_Events_EventDate);

AscendingKey25

You can see they’re up to date. They show 144,000 rows in total which is correct. Interestingly the Histogram (bottom result-set) only has a couple of steps. SQL has determined that the data is uniformly distributed so has bunched it altogether. Clever stuff!

Let’s insert data for another day:

--Insert one more day's data
DECLARE @StartDate DATETIME;

SELECT @StartDate = MAX(EventDate) FROM dbo.Events;

INSERT INTO dbo.Events(EventName, EventDate)
SELECT
    'Event' + CAST(num.n AS VARCHAR(10)),
    DATEADD(minute,num.n,@StartDate)
FROM 
(
    SELECT TOP 1440 row_number() OVER(ORDER BY (SELECT NULL)) AS n 
    FROM sys.objects a, sys.objects b, sys.objects c
) num;

Now I query to see the new events. I captured the MAX(EventDate) earlier so let’s use that to find the new records:

SELECT *
FROM dbo.Events
WHERE EventDate > '20170411'
OPTION (RECOMPILE);

(Notice I’ve added the option to recompile so I get a fresh plan each time I run this, that will be important for testing)

Let’s not bother with the results, we all know there will 1,440 records that are basically the same. Here’s the execution plan:

AscendingKey3

The interesting bit comes when I look at the properties for the Index Seek:

AscendingKey4

Estimated number of rows = 1, Actual = 1,440. That’s quite a long way out. Of course here we have a trivial query so the massive underestimate isn’t affecting our plan. If we started joining to other tables though it would likely result in a massively inefficient plan – perhaps choosing a Nested Loops join over a Hash or Merge.

Note I’m using SQL Server 2012 for this test and I’m not using the Traceflag (2371) which reduces the threshold for statistics updates ( When do Distribution Statistics Get Updated? ):

So I’ve got nearly another 30,000 rows to insert before statistics get automatically updated and my estimates come into line. If I’m always querying for the current day then it’s going to be very rare that statistics are going to be able to help me with a good estimate.

So what’s the fix?

Before we get on to the methods that have been introduced to try and ameliorate this problem, if you face this sort of scenario you might want to consider whether you need to update your statistics objects more often than the auto-stats threshold. If you have a regular job to rebuild fragmented indexes then those indexes that get rebuilt will have their statistics refreshed – however that won’t cover the auto created statistics, and it won’t cover statistics for tables that get don’t get rebuilt.

So, if you don’t have a specific scheduled job to regularly update statistics that is definitely worth considering.

In terms of how SQL has changed to help us, from SQL Server 2005 SP1, the nature of columns began to be tracked, monitoring updates of statistics to understand how the data changes. This additional information can be seen if you enable traceflag 2388, then view the statistics. Let’s have a look at what’s gathered. First I’m going to add a couple more days of data, updating the statistics between each insert, then I run the following:

DBCC TRACEON(2388);
DBCC SHOW_STATISTICS('dbo.Events',IX_Events_EventDate);
DBCC TRACEOFF(2388);

What you see here is historical information about the updates to the statistics. This is undocumented stuff, but some of what we see we can work out the meaning for. In particular we can see how many rows were inserted since the last statistics update, and how many of those values were above the top of the old histogram. We also see a column “Leading Column Type” which has a value of “Unknown”.

Now I’m going to insert another day’s date and update the statistics once more, then we’ll look at this again:

AscendingKey6

You can see that now we have a Leading Column Type of “Ascending”. After three updates to the statistics where the Leading Value was only increasing each time, SQL Server will identify that it is an ascending column. It must be at least three updates before SQL will recognise this, and if that stops being the case (i.e. some lower values are inserted) then the next statistics update will reset this until we again get three consecutive updates with only increasing values.

This happens in the background and you don’t need the traceflag 2388 to make it happen –that is just so you can see what is going on.

The obvious question is, now SQL knows my column is ascending, has that affected the estimation for my query? Before we look I’ll insert another day of data so there is some data beyond the histogram, and then I’ll query again:

SELECT *
FROM dbo.Events
WHERE EventDate > '20170415'
OPTION (RECOMPILE);

And the properties from the execution plan:

AscendingKey7

So nope. Nothing has changed.

Yet…

To tell the query optimizer to take advantage of this extra information for ascending keys we have traceflag 2389. Let’s enable that and run the query again:

DBCC TRACEON(2389);

SELECT *
FROM dbo.Events
WHERE EventDate > '20170415'
OPTION (RECOMPILE);

The properties:

AscendingKey8

Voila! SQL Server has now estimated my rowcount perfectly.

Now, be warned. This is a rather contrived example with a perfectly and artificially smooth distribution of data. The estimate is made by checking the current maximum value in the table, and using that combined with the information existing in the statistics and the value of your predicate to extrapolate a guess. If you’re data is evenly distributed as it is here then the guess will be pretty good, if it is fairly skewed then it may be bad.

In any case though it will probably be better that the fixed value of 1 that would have been used historically.

One thing to note is that traceflag 2389 is only going to have any affect if the leading column of the relevant statistics object has been marked as ascending. There is also traceflag 2390, and this will adopt a similar behaviour even if your column hasn’t been identified as ascending, i.e. it will check the maximum value in the table and if it is higher than the max value in the histogram, it will extrapolate to work out the cardinality estimate.

So should you turn on the traceflag(s) globally?

The Microsoft recommendation is not to enable traceflags such as these unless you are suffering from the specific problem they are aiming to resolve, and even then to make sure you test carefully to ensure they are achieving what you desire.

One issue can be that in more complex queries there are a number of cardinality estimates being made. It can be that two bad estimates within the same plan might cancel each other out and the query overall performs fine. If you then implement something that fixes one of them, you risk such queries going bad – a scenario known as plan regression.

This sort of scenario is one of the reasons why Microsoft have made very few core changes to the cardinality estimator since it came out.

So, use 2389 is you are specifically encountering this sort of ascending key problem, but also, if you are in the position to change the code then you might want to consider adding it as a query hint so it only affects the specific query you are targeting. For our example query above, that would simply look like:

SELECT *
FROM dbo.Events
WHERE EventDate > '20170415'
OPTION (RECOMPILE, QUERYTRACEON 2389);

Welcome to SQL Server 2014 (and later)

In 2014 we received a substantial revamp of the Cardinality Estimator, the first since SQL Server 7.0. A bunch of assumptions and algorithms have been re-jigged based on the real-world data that Microsoft have seen in supporting their wide customer base.

Key to having a new version was the concept that, henceforth, optimizer fixes would be tied to the database compatibility version. This means that customers can upgrade their SQL Server version but if they find performance problems related to the upgrade they can downgrade their database’s compatibility level while the issues are resolved within their codebase.

One of the items specifically looked at in the new version was this Ascending Key problem. To see how things work in the latest versions I’m going to repeat many of the steps above using a database deployed on SQL Server 2016.

So I:

Create my table again
Populate with the 100 days data
Run a query to check the dates, which has the added benefit of updating statistics
Add one more day’s data

Then I’m ready to run my test query again:

SELECT *
FROM dbo.Events
WHERE EventDate > '20170411'
OPTION (RECOMPILE);

I get the same execution plan as ever so again I jump to the properties of the Index Seek operator to look at the estimates:

AscendingKey9

Now, this is interesting. I might have expected I would get either 1 row estimated (the old model) or 1,440 (the model with traceflag 2389). Instead I get 432 rows. It seems the new CE (Cardinality Estimator) uses a different algorithm.

Sometimes numbers stick in your head. I happen to know that where no statistics are available and you are querying with an inequality predicate (<, > , <=, >=) that the CE will estimate the number of rows to be 30% of the total number of rows in the table. This assumption seems to have originated in a 1979 research paper from IBM suggesting 1/3 was a good guess.

With 30% in my head I noticed that 432 is 30% of 1440. So it seems that the optimizer is recognising that we are querying for values above the histogram (where no statistics exist) with an inequality, it knows from somewhere that there have been 1440 rows inserted since the last statistics update, so it takes 30% of 1440 to produce the estimate (432).

To try validate that theory I thought I’d query with a later datetime in the predicate. Sure enough, if I add 12 hours I still get 432 rows estimated. If I add 23 hours, 432 rows. In fact if I query for any date in the future, even outside of the maximum value in the table, guess what – I get an estimate of 432 rows.

I have a fascination for the algorithms involved in distribution statistics. It satisfies the maths geek in me. As such it’s difficult to end a post like this, there’s always more things to test, to try and work out. For instance what happens if you query across an interval that starts within the current histogram, but then extends above it? I’ll admit I’ve had a play, but will leave that for another post.

As a very final point in this post, I thought I’d just check whether the 2389 traceflag makes any difference to this estimation with the 2014 CE. I’ll change my query to look way into the future, enable the traceflag and look at the estimate:

SELECT *
FROM dbo.Events
WHERE EventDate > '99991231 23:59:59'
OPTION (RECOMPILE, QUERYTRACEON 2389);

Guess what? Still 432 rows… so no, the traceflag doesn’t appear to still give us any extra benefit.
Though when we get to that date it’ll be someone else’s problem to sort out!

SQL Server Agent and Daylight Saving time

July 19, 2017August 7, 2017 ~ Matthew McGiffen ~ 7 Comments

There was a bit of chat today on the SQL Community slack about UTC and Daylight Saving Time. It reminded me I have a post in draft somewhere about the complications of dealing with timezones in SQL that I should get around to finishing.

It also reminded me of a fact I found in an old Paul Randall post from 2007 that not a lot of people seem to know so I thought it was worth sharing again.

I discovered it when I was working on a kind of hand-cranked version of log shipping with a twist for SQL Server. Before you ask me why… don’t! I promise it was for a good reason. At least it seemed like a good idea at the time…

Anyway, I was aware that I needed to work out what I was going to do when the clocks go back in October of each year.

I decided to try and find out what the out of the box log shipping does to cater for this. I started with Google and came across this 2007 article by Paul Randall:

How does daylight savings time affect disaster recovery? – Paul S. Randal

Apparently SQL Server Agent pauses for an hour after the clocks go back. In the UK the clocks go back at 2 AM on the last Sunday in October. So all the times from 1 AM to 2 AM effectively happen twice. SQL Server Agent will run jobs from 1 AM to 2 AM as usual, then when at 2 AM the clocks go back to 1 AM, SQL Agent will pause for an hour (i.e. no new jobs will be started) and only resume the second time it gets to 2 AM.

Sounds a bit crazy in some ways, but I tend to believe most things Paul Randall says and it does seem to have been proved since in practice.

It means that you don’t get a confusing sequence of times in your job history and other events that don’t really make sense, and it also means that jobs won’t happen twice e.g. if you had an index rebuild in that time.

It could have issues for DR though, imagine that this period is actually an active one for your application and that you have an RPO (recovery point objective) of a maximum of (let’s say) 5 minutes data loss. During that hour, no LOG backups are being taken. Suddenly you’re vulnerable to not being able to meet your DR service agreement. Could be something to watch out for.

For me though (at the time I learnt this) it was great as I didn’t need to worry about coding a workaround in my log restore code!

SQL Puzzle 2: Eight Queens

July 19, 2017 ~ Matthew McGiffen ~ 7 Comments

This puzzle was first proposed in 1848 by a composer of chess puzzles called Max Bezzel and has since spawned much analysis and many variants. Simply phrased, it goes as follows:

“Can you place 8 queen’s on a standard (8×8) chessboard so that no two queen’s threaten each other?”

(Just in case, I’ll remind you a queen can move and attack any number of squares in a straight line – horizontally, vertically, or diagonally)

Here’s one example solution:

8queens

By all means have some fun trying to find extra solutions to this with pencil and paper or chessboard, however we do have computers (and SQL!) these days, so this month’s challenge is to try find ALL the possible solutions to this puzzle using T-SQL.

I’m a bit of a closet chess enthusiast, so expect more in this theme!

Enjoy 🙂

Think twice before using table variables

July 11, 2017July 21, 2017 ~ Matthew McGiffen ~ 9 Comments

T-SQL Tuesday

tsql2sday150x150

For T-SQL Tuesday this month Raul Gonzalez has asked us all to blog about lessons learnt the hard way:

http://www.sqldoubleg.com/2017/07/03/tsql2sday-92-lessons-learned-the-hard-way/

My biggest sins have been executing code against production environments when I thought I was pointing at my local machine:

DELETE FROM dbo.Blah;

DROP DATABASE Blah;

I’ve learned from those experiences – mostly that I prefer it when I don’t have access to production!

As I’ve been doing a few posts about Statistics and Cardinality estimation recently I thought I’d cover instead the subject of Table Variables. This has maybe already been blogged to death, but it’s still one of the most common anti-patterns I see when performance tuning and a lesson I have seen so many people forced learn the hard way.

Table Variables vs Temp Tables

Those experienced with programming SQL will be familiar with both Temp Tables, and Table Variables as ways of storing intermediate resultsets within (for instance) a stored procedure.

Table variables came along with SQL Server 2000 and there seems to have been a wave of belief arriving with them that they were the latest and greatest thing. That Microsoft wanted us to use them from now on and stop using Temp tables. The day of the Temp table was over, long live the Table Variable!

Many developers reworked their code to replace all their temp tables with variables.

Uh-oh…

Initial guidance from Microsoft recommended table variables as the default except “where there is a significant volume of data and there is repeated use of the table”

https://support.microsoft.com/en-us/help/305977/inf-frequently-asked-questions—sql-server-2000—table-variables

They did however stress that you should test for any given scenario.

More recently they’ve updated their documentation to suggest “you should be cautious about using a table variable if you expect a larger number of rows (greater than 100).”

https://docs.microsoft.com/en-us/sql/t-sql/data-types/table-transact-sql

I often hear even more cautious advice:

Only use a variable if there will be less than 30 rows
Only use a variable where there will only be 1 row
Never use table variables! (unless you have to)

I tend to go with one of the last two, there any many more examples where a table variable can screw up your performance than there are with temp tables.

There are times though where you still might want to use a table variable. They can be a little easier to work with, and sometime they might offer functionality you need (such as returning results from a table value function).

Just to clear up one misconception, both types of object are stored in TempDB and both of them will be mostly operative in memory rather than writing their data to disk (unless there are memory pressures, or bad estimates cause them to spill to disk). Even this is a bit more complicated than that, but in general they are managed the same way in the background.

What’s the big difference?

The big difference, and the reason you often don’t want to use a table variable is about statistics and cardinality estimation (which we’ve been looking at in the last few posts). In brief recap – statistics are what SQL uses to estimate how many rows (the cardinality) will be generated during part of a query operation. If it gets this estimation about right it can use that to form a good plan of attack for executing the query. If it gets it wrong it can make a dogs dinner of it.

And here’s the thing. Temp tables have statistics generated on them, but Table Variables don’t. As far as SQL’s concerned your table variable has exactly one row in it – regardless of reality.

Caveat: The way Temp tables manage statistics is a little bit weird, so you can get unpredictable result sometimes – but at least they try! See this if you want a really deep dive on the subject:

Page Free Space : Temporary Tables in Stored Procedures

Anyway, let’s look at a quick couple of examples of this and how it can make things go wrong.

Example 1: Getting that estimation badly wrong

First of all just to prove the point. Let’s create a table variable and a temp table and dump a million rows into each so we can look at how SQL manages estimates from them.

Table variable first:

DECLARE @a TABLE(i INT);
 
--Create a million rows
WITH Nums(i) AS
(
       SELECT 1
       FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) n(i)
)
INSERT INTO @a(i)
SELECT n1.i
FROM Nums n1 
CROSS JOIN Nums n2 
CROSS JOIN Nums n3 
CROSS JOIN Nums n4 
CROSS JOIN Nums n5 
CROSS JOIN Nums n6;
 
SELECT i FROM @a;

I clicked the button in SSMS to capture the actual execution plan for this so we can see what is going on under the covers. I’m only interested in the final SELECT statement, which produces the pretty basic plan:

Variables1

What I’m really interested in though is the properties of the Table Scan, which you can get by right-clicking and selecting Properties – or just hover over to get the Tool Tip you see below:

Variables2

Look at where it says “Number of Rows Read” and “Actual Number of Rows” and you’ll see 1,000,000. Then look down to the Estimated Number of Rows and you’ll see 1. That’s quite a big difference.

Just to prove this is because we’re using a table variable, let’s quickly look at the same example with a temp table:

CREATE TABLE #a(i INT);
 
-- Create a million rows
WITH Nums(i) AS
(
       SELECT 1
       FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) n(i)
)
INSERT INTO #a(i)
SELECT n1.i
FROM Nums n1 
CROSS JOIN Nums n2 
CROSS JOIN Nums n3 
CROSS JOIN Nums n4 
CROSS JOIN Nums n5 
CROSS JOIN Nums n6;
 
SELECT i FROM #a;
 
DROP TABLE #a;

Here again’s the plan for the select, you’ll see it’s exactly the same:

Variables3

But look at the values in the Tool Tip this time:

Variables4

Now you’ll see all the values for number of rows are 1,000,000 – including the estimated number or rows which is very important as that’s the one SQL uses to work out what execution plan to use.

So that example was just to show how badly wrong the estimation is, but it probably didn’t hurt us that much as SQL selected the same plan in both cases. Let’s look at an example where it does bring some pain.

Example 2: Selecting a dodgy execution plan

The following query uses the AdventureWorks2012 database:

DECLARE @BusinessEntityId TABLE (BusinessEntityID INT NOT NULL PRIMARY KEY CLUSTERED);
 
INSERT INTO @BusinessEntityId
SELECT BusinessEntityID FROM Person.Person;
 
SET STATISTICS IO ON;
SET STATISTICS TIME ON;
 
SELECT COUNT(*)
FROM @BusinessEntityId b
INNER JOIN Person.Person p
       ON b.BusinessEntityID = p.BusinessEntityID;
 
SET STATISTICS IO OFF;
SET STATISTICS TIME OFF;

Now let’s look at the execution plan for that final select:

Variables5

Nothing that’s too obviously controversial. It’s doing a Clustered Index scan on our Table Variable which is fine as we want all the rows that match a value in there. Then it’s doing an index seek on the Person table so it’s going straight to the results we want. Great!

Or Not…

Let’s look at the Tool Tips again, first for the Clustered Index Scan:

Variables6

We focus on the same entries of before. Actual number of rows 19,772. Estimated number of Rows 1. Bit of a discrepancy. Is that affecting our plan though? All we can be sure of at this stage is that SQL thinks that’s a pretty good plan if we only had one row in the table variable. But we have a lot more than that.

Let’s look at the Tool Tip for the Clustered Index Seek:

Variables7

We can look at the same entries as before and we see the same discrepancy. But also look at the Number of Executions – SQL estimates it would have to execute this operator once as it thinks there’s exactly one row in the table variable. In actuality it executed it 19,772 times. That’s 19,772 seeks on the same table. I happen to know that’s the same as the number of records in the table. Surely there’s a better way of doing this.

In general the Nested Loops operator you see in the execution plan is good when the first( top) table is small compared to the second (bottom) table. Maybe it’s not the best choice in this case.

Now let’s look at the same example with a temp table:

CREATE TABLE  #BusinessEntityId (BusinessEntityID INT NOT NULL PRIMARY KEY CLUSTERED);
 
INSERT INTO #BusinessEntityId
SELECT BusinessEntityID FROM Person.Person;
 
SET STATISTICS IO ON;
SET STATISTICS TIME ON;
 
SELECT COUNT(*)
FROM #BusinessEntityId b
INNER JOIN Person.Person p
       ON b.BusinessEntityID = p.BusinessEntityID;
 
SET STATISTICS IO OFF;
SET STATISTICS TIME OFF;
 
DROP TABLE #BusinessEntityId;

Here’s the plan:

Variables8

You’ll hopefully see that this plan is different. It’s using something called a Hash Match operator – this creates a hashed value based on each row from the top input (the index scan on our temp table), and then feed into that an index scan from our bottom input – the Person table.

Rather than going into too much the implications of that, let’s jump straight to looking at the Tool Tips.

From the top operator first, the Clustered Index Scan:

Variables9

We can see here that the actual and estimated number of rows are both the same. Good so far.

Now to look at the bottom operator, the index scan on the Person table:

Variables10

Now we can see that the estimated and actual number of rows are the same again, but also the estimated and actual number of executions are the same – 1 in both cases.

So we have the query with the table variable generating an execution plan that results in nearly 20,000 seeks against an index vs. the query with a temp table generating 1 scan against the same index. In general seeks are quicker than scans, but probably not 20,000 times quicker, even with the effort of the Hashing part of the operation. To quantify the difference we can look at the output from the STATISTICS IO command for each query:

Table variable:

Table ‘Person’. Scan count 0, logical reads 59916, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table ‘#B182BEEB’. Scan count 1, logical reads 35, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 47 ms, elapsed time = 39 ms.

Temp table:

Table ‘Workfile’. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table ‘Person’. Scan count 1, logical reads 67, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table ‘#BusinessEntityId___________________________________________________________________________________________________00000000013D’. Scan count 1, logical reads 35, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 15 ms, elapsed time = 13 ms.

The output from the temp table query looks more complicated as it mentions a Workfile and Worktable that don’t get used, but it’s easy enough to see that Temp Table runs 3-4 times quicker, and more significantly generates only about 100 logical reads as opposed to about 60,000 for the table variable query. That’s a VERY big difference.

So like I said at the top, if you’re thinking about using a table variable, think twice, it might not be what you want to do.

When do Distribution Statistics Get Updated?

July 5, 2017January 18, 2021 ~ Matthew McGiffen ~ 5 Comments

An updated version of this post can be found here.

Statistics objects are important to us for allowing SQL to make good estimates of the row-counts involved in different parts of a given query and to allow the SQL Optimiser to form efficient execution plans to delivery those query results.

Statistics get updated automatically when you rebuild (or re-organise) an index they are based on – but we only tend to rebuild indexes that are fragmented, and we don’t need fragmentation for statistics to be stale. We also may have many auto-created statistics objects that are not related to an index at all.

It’s generally recommended to have the database level setting AUTO_UPDATE_STATISTICS turned on, so that SQL can manage the process of keeping statistics up to date for us. The only excuse to turn it off is that you are managing the updates to stats yourself in a different manner. And you can always turn the auto update off at an individual table or statistics level if you need to, rather than for the whole database.

SQL Server has had the ability to automatically update statistics since version 7.0. Nonetheless for a long part of my career working with SQL Server, whenever a performance issue raised its head everyone’s knee-jerk response would be “Update Statistics!” In most cases though the people shouting that didn’t really understand what the “Statistics” were, or what mechanisms might already be in place for keeping them up to date.

Of course SQL Server isn’t perfect and sometimes it is helpful for human intelligence to intervene. But to provide intelligent intervention one has to understand how things work.

So how does the automatic updating of statistics work?

In the background SQL maintains a count of changes to tables that might affect statistics. This can be updates, inserts or deletes. So if I inserted 100 records, updated 100 records and then deleted 100 records, I would have made 300 changes.

When SQL forms an execution plan for a query it references various distribution statistics objects to estimate row-counts and to use that to try find the best plan. The statistics objects it looks at are referred to as being “interesting” in the context of the query.

Before using values from the statistics, the Optimizer will check to see if the statistics are “stale”, i.e. the modification counter exceeds a given threshold. If it does, SQL will trigger a resampling of the statistics before going on to form an execution plan. This means that the plan will be formed against up to date statistics for the table.

For subsequent executions of the query, the existing plan will be loaded from the plan cache. Within the plan, the Optimiser can see a list of the statistics objects that were deemed “interesting” in the first place. Once again it will check each of them to see if they are “stale”. If they are, an auto-update of the statistics object(s) will be triggered and once that is complete the plan will be recompiled, in case the updated statistics might suggest a better way of executing the query. Equally, if any of the statistics objects have been updated since the last execution then the plan will also be recompiled.

One important caveat to this is the database level setting AUTO_UPDATE_STATS_ASYNC (Asynchronously). Generally it is best to have this turned off, in which case the above behaviour is observed. If you turn it on however, in the case of stale stats the query execution will not wait for the stats to be updated, but will start them updating in the background while the query executes. The plan will only recompile to be based on the new stats at the next execution.

From SQL Server2008 R2 SP2 and SQL Server 2012 SP1 we have a new DMF (Dynamic Management Function) sys.dm_db_stats_properties that allows us to see how many row modifications have been captured against a given statistics object as well as when it was last refreshed, how many rows were sampled etc. Modifications are captured on a per column basis (though when statistics were originally introduced in SQL Server it was per table) so the counter will only be affected if the leading column for the statistics object has been affected by a given operation.

SELECT s.name AS StatsName, sp.* FROM sys.stats s CROSS apply sys.dm_db_stats_properties(s.OBJECT_ID, s.stats_id) sp WHERE s.name = 'IX_Test_TextValue'

Results:

So what are the thresholds?

For a long time the thresholds were as follows. Statistics were considered stale if one of the following was true:

The table size has gone from 0 rows to more than 0 rows
The table had 500 rows or less when the statistics were last sampled and has since had more than 500 modifications
The table had more than 500 rows when the statistics were last sampled and the number of modifications is more than 500 + 20% of the row-count when the statistics were last sampled (when talking about tables with larger row-counts a lot of the documentation just describes this as 20% as the additional 500 becomes less and less relevant the larger the number you are dealing with).

Those thresholds did mean that when a table had a large number of rows, Statistics might not get updated that often. A table with a million rows would only have stats updated if about 200,000 rows changed. Depending on the distribution of the data and how it is being queried this could be a problem.

So, in SQL 2008 R2 SP2 Microsoft introduced Traceflag 2371 which when set would reduce the stale statistics threshold for larger tables. From SQL 2016 this is the default functionality.

That adds the following test for statistics being stale:

If the number of rows (R) when the statistics were last sampled is 25,000 or more and the number of modifications is more than the square root of R x 1000:

Statistics_1000R

Now, I’m just going to correct myself here, the documentation I’ve found SAYS the threshold is 25,000 but when I started to have a play that didn’t seem to be the case at all.

What actually seems to happen is that whichever of the two estimates is smaller gets used i.e

Either:

Statistics_20pcnt

Or:

Statistics_1000R

Whichever is smaller.

I don’t know if this means that both get evaluated and the smaller is used, or if the threshold between the two rules is simply defined at the point where the second formula gives the smaller result – which is after 19,682 rows. I discovered that threshold by solving where the two equations above would give the same result – then by experimenting to prove it in practice.

I think this incorrect stating of 25,000 as the threshold probably comes from confusion, taking an approximation (20%) as the actual figure. Remember I mentioned that people often generalise to say that statistics are stale after 20% of the rows change, and forget about the extra 500 rows. If that was true and it was exactly 20%, then the threshold would be 25,000 as that would be the point that both equations are equal.

Anyway it’s not even vaguely important to know that. I just found it interesting! Note that the tests above were carried out on SQL Server 2012 SP3 so could well be different on later versions.

To more visually understand the above rules, here’s a table showing the thresholds for some example table sizes under both the Old algorithm (without the traceflag) and the New algorithm (with the traceflag or on SQL 2016 or later).

R is the number of rows when the statistics were last sampled and T is the number of modifications for statistics to be considered stale:

Statistics_Thresholds

You can see for the larger table sizes there is a massive difference. If you’ve got large tables you’re querying against and are having to update the statistics manually to keep them fresh then you may find implementing the traceflag is a help.

For large tables statistics are sampled when being updated rather than the whole table being necessarily being read. I have details on that in this post:

Automatic Sample Sizes for Statistics Updates

Matthew McGiffen Data

Month: July 2017