Automatic Sample Sizes for Statistics Updates

August 15, 2017 ~ Matthew McGiffen ~ 8 Comments

I mentioned in my previous post about manually updating statistics that you can specify whether they’re updated using a full scan, or you can specify an amount of data to sample, either a percentage of the table size, or a fixed number of rows. You can also choose not to specify this, and SQL Server will decide for you whether to do a full scan, or to sample a certain amount of data.

I thought it would be interesting to look at what the sample sizes are that SQL will choose to use, depending on the amount of data in your table. Note that this occurs if you update statistics without specifying how they should be sampled as below:

UPDATE STATISTICS dbo.Test _WA_Sys_00000002_3AD6B8E2;

This is also the behaviour you will get when SQL updates statistics through the auto-stats mechanism. The fact that auto-stats may sample at a lower rate than is optimal for a given table and the queries against it is another reason you may choose to perform manual statistics updates.

To test this, I created a table and progressively pumped data in. Then after I inserted each batch of rows:

I Ran a stats update capturing the CPU time taken
Checked the statistics to see what sample size was used in the update
Checked the size of the index

Here’s some of the code I used for the test:

--Drop table if exists
IF (OBJECT_ID('dbo.Test')) IS NOT NULL DROP TABLE dbo.Test;

--Create table for Testing
CREATE TABLE dbo.Test(
   Id INT IDENTITY(1,1) CONSTRAINT PK_Test PRIMARY KEY CLUSTERED, 
   TextValue VARCHAR(20) NULL
);

--Create index on TextValue
CREATE INDEX IX_Test_TextValue ON dbo.Test(TextValue);

--Insert a bunch of rows
INSERT INTO dbo.Test(TEXTValue) 
SELECT TOP 100000 'blah'
FROM sys.objects a, sys.objects b, sys.objects c, sys.objects d;  

--Update statistics without specifying how many rows to sample
SET STATISTICS TIME ON;
UPDATE STATISTICS dbo.Test IX_Test_TextValue;
SET STATISTICS TIME OFF;

--View the statistics
DBCC SHOW_STATISTICS('dbo.Test', IX_Test_TextValue) WITH STAT_HEADER;

--Check the size of the index
SELECT
    i.name                  AS IndexName,
    SUM(s.used_page_count)   AS Pages
FROM sys.dm_db_partition_stats  AS s 
JOIN sys.indexes                AS i
ON s.[object_id] = i.[object_id] AND s.index_id = i.index_id
WHERE i.name = 'IX_Test_TextValue'
GROUP BY i.name

The results of my testing are shown in the below table:

StatisticsSamplingRates

You can see that we have a full sample being taken for the statistics updates up to 4000,000 records (896 pages) but that once the table size hits 500,000 sampling is happening. If you look at the number of pages you will see we now have over 1,000 pages, 1000 pages being about 8MB of data, which is the threshold that sampling kicks in.

I wasn’t able to find a nice neat formula to determine the sampling rate based on table size, but if we look at the above figures there are still some insights to be gained. The main one is that you’ll notice that even as we double the table size, the number of rows sampled doesn’t go up by much. For instance from 500,000 to a million rows, only 10,000 more rows are sampled. This also means that even for pretty large tables, the update isn’t taking long – another reason why it’s worth leaving auto stats updates enabled and running synchronously with queries – they’re generally not going to take that long.

Another insight is that the percentage of rows sampled drops off very quickly. As the sample size doesn’t really increase that much even when the table size doubles – the percentage sampled has almost halved each time.

Statistics and the Ascending Key Problem

July 25, 2017 ~ Matthew McGiffen ~ 2 Comments

I’ve mentioned previously how not having up to date statistics can cause problems in query performance. This post looks at something called the Ascending Key Problem which can badly affect your cardinality estimation in some cases and therefore your execution plans.

The Ascending Key Problem relates to the most recently inserted data in your table which is therefore also the data that may not have been sampled and included in the statistics histograms. This sort of issue is one of the reasons it can be critical to update your statistics more regularly than the built-in automatic thresholds.

We’ll look at the problem itself, but also some of the mitigations that you can take to deal with it within SQL Server.

The Problem

Imagine you have a table that stores a set of events. As new records are inserted they get stamped with the current date and time. You regularly query that table based on that EventDate looking to find recent events, let’s say just for the current day.

Even if you haven’t indexed the EventDate column (though why haven’t you?!), as long as you have AUTO CREATE STATISTICS and AUTO UPDATE STATISTICS on for your database you’ll have “reasonably” up to date statistics for that column.

But “reasonably” may not be good enough. When you’re querying against the most recent data it may not yet have been sampled by a statistics update and the range you are looking for may fall beyond the top of the histogram captured in the statistics object for EventDate. Imagine that statistics were last updated yesterday. When the Optimizer checks the statistics to estimate a rowcount for today’s date it finds that is above the top bound. So what should it guess?

Historically it would guess that there were zero rows, but as always the cardinality estimation gets set to the minimum of 1. If the real answer is a lot larger you might end up with a bad plan.

Let’s look at that in practice.

Staying true to the example above, I create a table called Events and I index the EventDate column:

CREATE TABLE dbo.Events
(
Id INT IDENTITY(1,1) CONSTRAINT PK_Events PRIMARY KEY CLUSTERED,
EventName VARCHAR(255) NOT NULL,
EventDate DATETIME CONSTRAINT DF_Events_EventDate DEFAULT (GETDATE())
);

CREATE INDEX IX_Events_EventDate ON dbo.Events(EventDate) include (EventName);

Then I insert records to represent events at one minute intervals for 100 days:

--Insert data for 100 days at minute intervals from the start of this year
DECLARE @StartDate DATETIME = '20170101 00:00.00';

INSERT INTO dbo.Events(EventName, EventDate)
SELECT
    'Event' + CAST(num.n AS VARCHAR(10)),
    DATEADD(minute,num.n,@StartDate)
FROM 
(
    SELECT TOP 144000 row_number() OVER(ORDER BY (SELECT NULL)) AS n 
    FROM sys.objects a, sys.objects b, sys.objects c
) num;

I’m going to query to check what date range was inserted. That should have the additional advantage of triggering a statistics update:

SELECT MIN(EventDate), MAX(EventDate)
FROM dbo.Events;

As a slight digression, it’s interesting to look at the execution plan here:

AscendingKey1

You can see two index scans. That sounds horrendous, scan the index twice to find the MIN and MAX? If you look at the properties though you can see it only read one row in each case:

AscendingKey2

An index scan doesn’t have to read all the records in the index, it can bail out once it is satisfied. For a MIN or MAX type query it makes perfect sense just to jump to one end of the index and start scanning.

The side lesson is that Scans aren’t always bad for performance.

Anyway, back to the topic in hand. Now let’s look at the statistics:

DBCC SHOW_STATISTICS('dbo.Events',IX_Events_EventDate);

AscendingKey25

You can see they’re up to date. They show 144,000 rows in total which is correct. Interestingly the Histogram (bottom result-set) only has a couple of steps. SQL has determined that the data is uniformly distributed so has bunched it altogether. Clever stuff!

Let’s insert data for another day:

--Insert one more day's data
DECLARE @StartDate DATETIME;

SELECT @StartDate = MAX(EventDate) FROM dbo.Events;

INSERT INTO dbo.Events(EventName, EventDate)
SELECT
    'Event' + CAST(num.n AS VARCHAR(10)),
    DATEADD(minute,num.n,@StartDate)
FROM 
(
    SELECT TOP 1440 row_number() OVER(ORDER BY (SELECT NULL)) AS n 
    FROM sys.objects a, sys.objects b, sys.objects c
) num;

Now I query to see the new events. I captured the MAX(EventDate) earlier so let’s use that to find the new records:

SELECT *
FROM dbo.Events
WHERE EventDate > '20170411'
OPTION (RECOMPILE);

(Notice I’ve added the option to recompile so I get a fresh plan each time I run this, that will be important for testing)

Let’s not bother with the results, we all know there will 1,440 records that are basically the same. Here’s the execution plan:

AscendingKey3

The interesting bit comes when I look at the properties for the Index Seek:

AscendingKey4

Estimated number of rows = 1, Actual = 1,440. That’s quite a long way out. Of course here we have a trivial query so the massive underestimate isn’t affecting our plan. If we started joining to other tables though it would likely result in a massively inefficient plan – perhaps choosing a Nested Loops join over a Hash or Merge.

Note I’m using SQL Server 2012 for this test and I’m not using the Traceflag (2371) which reduces the threshold for statistics updates ( When do Distribution Statistics Get Updated? ):

So I’ve got nearly another 30,000 rows to insert before statistics get automatically updated and my estimates come into line. If I’m always querying for the current day then it’s going to be very rare that statistics are going to be able to help me with a good estimate.

So what’s the fix?

Before we get on to the methods that have been introduced to try and ameliorate this problem, if you face this sort of scenario you might want to consider whether you need to update your statistics objects more often than the auto-stats threshold. If you have a regular job to rebuild fragmented indexes then those indexes that get rebuilt will have their statistics refreshed – however that won’t cover the auto created statistics, and it won’t cover statistics for tables that get don’t get rebuilt.

So, if you don’t have a specific scheduled job to regularly update statistics that is definitely worth considering.

In terms of how SQL has changed to help us, from SQL Server 2005 SP1, the nature of columns began to be tracked, monitoring updates of statistics to understand how the data changes. This additional information can be seen if you enable traceflag 2388, then view the statistics. Let’s have a look at what’s gathered. First I’m going to add a couple more days of data, updating the statistics between each insert, then I run the following:

DBCC TRACEON(2388);
DBCC SHOW_STATISTICS('dbo.Events',IX_Events_EventDate);
DBCC TRACEOFF(2388);

What you see here is historical information about the updates to the statistics. This is undocumented stuff, but some of what we see we can work out the meaning for. In particular we can see how many rows were inserted since the last statistics update, and how many of those values were above the top of the old histogram. We also see a column “Leading Column Type” which has a value of “Unknown”.

Now I’m going to insert another day’s date and update the statistics once more, then we’ll look at this again:

AscendingKey6

You can see that now we have a Leading Column Type of “Ascending”. After three updates to the statistics where the Leading Value was only increasing each time, SQL Server will identify that it is an ascending column. It must be at least three updates before SQL will recognise this, and if that stops being the case (i.e. some lower values are inserted) then the next statistics update will reset this until we again get three consecutive updates with only increasing values.

This happens in the background and you don’t need the traceflag 2388 to make it happen –that is just so you can see what is going on.

The obvious question is, now SQL knows my column is ascending, has that affected the estimation for my query? Before we look I’ll insert another day of data so there is some data beyond the histogram, and then I’ll query again:

SELECT *
FROM dbo.Events
WHERE EventDate > '20170415'
OPTION (RECOMPILE);

And the properties from the execution plan:

AscendingKey7

So nope. Nothing has changed.

Yet…

To tell the query optimizer to take advantage of this extra information for ascending keys we have traceflag 2389. Let’s enable that and run the query again:

DBCC TRACEON(2389);

SELECT *
FROM dbo.Events
WHERE EventDate > '20170415'
OPTION (RECOMPILE);

The properties:

AscendingKey8

Voila! SQL Server has now estimated my rowcount perfectly.

Now, be warned. This is a rather contrived example with a perfectly and artificially smooth distribution of data. The estimate is made by checking the current maximum value in the table, and using that combined with the information existing in the statistics and the value of your predicate to extrapolate a guess. If you’re data is evenly distributed as it is here then the guess will be pretty good, if it is fairly skewed then it may be bad.

In any case though it will probably be better that the fixed value of 1 that would have been used historically.

One thing to note is that traceflag 2389 is only going to have any affect if the leading column of the relevant statistics object has been marked as ascending. There is also traceflag 2390, and this will adopt a similar behaviour even if your column hasn’t been identified as ascending, i.e. it will check the maximum value in the table and if it is higher than the max value in the histogram, it will extrapolate to work out the cardinality estimate.

So should you turn on the traceflag(s) globally?

The Microsoft recommendation is not to enable traceflags such as these unless you are suffering from the specific problem they are aiming to resolve, and even then to make sure you test carefully to ensure they are achieving what you desire.

One issue can be that in more complex queries there are a number of cardinality estimates being made. It can be that two bad estimates within the same plan might cancel each other out and the query overall performs fine. If you then implement something that fixes one of them, you risk such queries going bad – a scenario known as plan regression.

This sort of scenario is one of the reasons why Microsoft have made very few core changes to the cardinality estimator since it came out.

So, use 2389 is you are specifically encountering this sort of ascending key problem, but also, if you are in the position to change the code then you might want to consider adding it as a query hint so it only affects the specific query you are targeting. For our example query above, that would simply look like:

SELECT *
FROM dbo.Events
WHERE EventDate > '20170415'
OPTION (RECOMPILE, QUERYTRACEON 2389);

Welcome to SQL Server 2014 (and later)

In 2014 we received a substantial revamp of the Cardinality Estimator, the first since SQL Server 7.0. A bunch of assumptions and algorithms have been re-jigged based on the real-world data that Microsoft have seen in supporting their wide customer base.

Key to having a new version was the concept that, henceforth, optimizer fixes would be tied to the database compatibility version. This means that customers can upgrade their SQL Server version but if they find performance problems related to the upgrade they can downgrade their database’s compatibility level while the issues are resolved within their codebase.

One of the items specifically looked at in the new version was this Ascending Key problem. To see how things work in the latest versions I’m going to repeat many of the steps above using a database deployed on SQL Server 2016.

So I:

Create my table again
Populate with the 100 days data
Run a query to check the dates, which has the added benefit of updating statistics
Add one more day’s data

Then I’m ready to run my test query again:

SELECT *
FROM dbo.Events
WHERE EventDate > '20170411'
OPTION (RECOMPILE);

I get the same execution plan as ever so again I jump to the properties of the Index Seek operator to look at the estimates:

AscendingKey9

Now, this is interesting. I might have expected I would get either 1 row estimated (the old model) or 1,440 (the model with traceflag 2389). Instead I get 432 rows. It seems the new CE (Cardinality Estimator) uses a different algorithm.

Sometimes numbers stick in your head. I happen to know that where no statistics are available and you are querying with an inequality predicate (<, > , <=, >=) that the CE will estimate the number of rows to be 30% of the total number of rows in the table. This assumption seems to have originated in a 1979 research paper from IBM suggesting 1/3 was a good guess.

With 30% in my head I noticed that 432 is 30% of 1440. So it seems that the optimizer is recognising that we are querying for values above the histogram (where no statistics exist) with an inequality, it knows from somewhere that there have been 1440 rows inserted since the last statistics update, so it takes 30% of 1440 to produce the estimate (432).

To try validate that theory I thought I’d query with a later datetime in the predicate. Sure enough, if I add 12 hours I still get 432 rows estimated. If I add 23 hours, 432 rows. In fact if I query for any date in the future, even outside of the maximum value in the table, guess what – I get an estimate of 432 rows.

I have a fascination for the algorithms involved in distribution statistics. It satisfies the maths geek in me. As such it’s difficult to end a post like this, there’s always more things to test, to try and work out. For instance what happens if you query across an interval that starts within the current histogram, but then extends above it? I’ll admit I’ve had a play, but will leave that for another post.

As a very final point in this post, I thought I’d just check whether the 2389 traceflag makes any difference to this estimation with the 2014 CE. I’ll change my query to look way into the future, enable the traceflag and look at the estimate:

SELECT *
FROM dbo.Events
WHERE EventDate > '99991231 23:59:59'
OPTION (RECOMPILE, QUERYTRACEON 2389);

Guess what? Still 432 rows… so no, the traceflag doesn’t appear to still give us any extra benefit.
Though when we get to that date it’ll be someone else’s problem to sort out!

Think twice before using table variables

July 11, 2017July 21, 2017 ~ Matthew McGiffen ~ 9 Comments

T-SQL Tuesday

tsql2sday150x150

For T-SQL Tuesday this month Raul Gonzalez has asked us all to blog about lessons learnt the hard way:

http://www.sqldoubleg.com/2017/07/03/tsql2sday-92-lessons-learned-the-hard-way/

My biggest sins have been executing code against production environments when I thought I was pointing at my local machine:

DELETE FROM dbo.Blah;

DROP DATABASE Blah;

I’ve learned from those experiences – mostly that I prefer it when I don’t have access to production!

As I’ve been doing a few posts about Statistics and Cardinality estimation recently I thought I’d cover instead the subject of Table Variables. This has maybe already been blogged to death, but it’s still one of the most common anti-patterns I see when performance tuning and a lesson I have seen so many people forced learn the hard way.

Table Variables vs Temp Tables

Those experienced with programming SQL will be familiar with both Temp Tables, and Table Variables as ways of storing intermediate resultsets within (for instance) a stored procedure.

Table variables came along with SQL Server 2000 and there seems to have been a wave of belief arriving with them that they were the latest and greatest thing. That Microsoft wanted us to use them from now on and stop using Temp tables. The day of the Temp table was over, long live the Table Variable!

Many developers reworked their code to replace all their temp tables with variables.

Uh-oh…

Initial guidance from Microsoft recommended table variables as the default except “where there is a significant volume of data and there is repeated use of the table”

https://support.microsoft.com/en-us/help/305977/inf-frequently-asked-questions—sql-server-2000—table-variables

They did however stress that you should test for any given scenario.

More recently they’ve updated their documentation to suggest “you should be cautious about using a table variable if you expect a larger number of rows (greater than 100).”

https://docs.microsoft.com/en-us/sql/t-sql/data-types/table-transact-sql

I often hear even more cautious advice:

Only use a variable if there will be less than 30 rows
Only use a variable where there will only be 1 row
Never use table variables! (unless you have to)

I tend to go with one of the last two, there any many more examples where a table variable can screw up your performance than there are with temp tables.

There are times though where you still might want to use a table variable. They can be a little easier to work with, and sometime they might offer functionality you need (such as returning results from a table value function).

Just to clear up one misconception, both types of object are stored in TempDB and both of them will be mostly operative in memory rather than writing their data to disk (unless there are memory pressures, or bad estimates cause them to spill to disk). Even this is a bit more complicated than that, but in general they are managed the same way in the background.

What’s the big difference?

The big difference, and the reason you often don’t want to use a table variable is about statistics and cardinality estimation (which we’ve been looking at in the last few posts). In brief recap – statistics are what SQL uses to estimate how many rows (the cardinality) will be generated during part of a query operation. If it gets this estimation about right it can use that to form a good plan of attack for executing the query. If it gets it wrong it can make a dogs dinner of it.

And here’s the thing. Temp tables have statistics generated on them, but Table Variables don’t. As far as SQL’s concerned your table variable has exactly one row in it – regardless of reality.

Caveat: The way Temp tables manage statistics is a little bit weird, so you can get unpredictable result sometimes – but at least they try! See this if you want a really deep dive on the subject:

Page Free Space : Temporary Tables in Stored Procedures

Anyway, let’s look at a quick couple of examples of this and how it can make things go wrong.

Example 1: Getting that estimation badly wrong

First of all just to prove the point. Let’s create a table variable and a temp table and dump a million rows into each so we can look at how SQL manages estimates from them.

Table variable first:

DECLARE @a TABLE(i INT);
 
--Create a million rows
WITH Nums(i) AS
(
       SELECT 1
       FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) n(i)
)
INSERT INTO @a(i)
SELECT n1.i
FROM Nums n1 
CROSS JOIN Nums n2 
CROSS JOIN Nums n3 
CROSS JOIN Nums n4 
CROSS JOIN Nums n5 
CROSS JOIN Nums n6;
 
SELECT i FROM @a;

I clicked the button in SSMS to capture the actual execution plan for this so we can see what is going on under the covers. I’m only interested in the final SELECT statement, which produces the pretty basic plan:

Variables1

What I’m really interested in though is the properties of the Table Scan, which you can get by right-clicking and selecting Properties – or just hover over to get the Tool Tip you see below:

Variables2

Look at where it says “Number of Rows Read” and “Actual Number of Rows” and you’ll see 1,000,000. Then look down to the Estimated Number of Rows and you’ll see 1. That’s quite a big difference.

Just to prove this is because we’re using a table variable, let’s quickly look at the same example with a temp table:

CREATE TABLE #a(i INT);
 
-- Create a million rows
WITH Nums(i) AS
(
       SELECT 1
       FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) n(i)
)
INSERT INTO #a(i)
SELECT n1.i
FROM Nums n1 
CROSS JOIN Nums n2 
CROSS JOIN Nums n3 
CROSS JOIN Nums n4 
CROSS JOIN Nums n5 
CROSS JOIN Nums n6;
 
SELECT i FROM #a;
 
DROP TABLE #a;

Here again’s the plan for the select, you’ll see it’s exactly the same:

Variables3

But look at the values in the Tool Tip this time:

Variables4

Now you’ll see all the values for number of rows are 1,000,000 – including the estimated number or rows which is very important as that’s the one SQL uses to work out what execution plan to use.

So that example was just to show how badly wrong the estimation is, but it probably didn’t hurt us that much as SQL selected the same plan in both cases. Let’s look at an example where it does bring some pain.

Example 2: Selecting a dodgy execution plan

The following query uses the AdventureWorks2012 database:

DECLARE @BusinessEntityId TABLE (BusinessEntityID INT NOT NULL PRIMARY KEY CLUSTERED);
 
INSERT INTO @BusinessEntityId
SELECT BusinessEntityID FROM Person.Person;
 
SET STATISTICS IO ON;
SET STATISTICS TIME ON;
 
SELECT COUNT(*)
FROM @BusinessEntityId b
INNER JOIN Person.Person p
       ON b.BusinessEntityID = p.BusinessEntityID;
 
SET STATISTICS IO OFF;
SET STATISTICS TIME OFF;

Now let’s look at the execution plan for that final select:

Variables5

Nothing that’s too obviously controversial. It’s doing a Clustered Index scan on our Table Variable which is fine as we want all the rows that match a value in there. Then it’s doing an index seek on the Person table so it’s going straight to the results we want. Great!

Or Not…

Let’s look at the Tool Tips again, first for the Clustered Index Scan:

Variables6

We focus on the same entries of before. Actual number of rows 19,772. Estimated number of Rows 1. Bit of a discrepancy. Is that affecting our plan though? All we can be sure of at this stage is that SQL thinks that’s a pretty good plan if we only had one row in the table variable. But we have a lot more than that.

Let’s look at the Tool Tip for the Clustered Index Seek:

Variables7

We can look at the same entries as before and we see the same discrepancy. But also look at the Number of Executions – SQL estimates it would have to execute this operator once as it thinks there’s exactly one row in the table variable. In actuality it executed it 19,772 times. That’s 19,772 seeks on the same table. I happen to know that’s the same as the number of records in the table. Surely there’s a better way of doing this.

In general the Nested Loops operator you see in the execution plan is good when the first( top) table is small compared to the second (bottom) table. Maybe it’s not the best choice in this case.

Now let’s look at the same example with a temp table:

CREATE TABLE  #BusinessEntityId (BusinessEntityID INT NOT NULL PRIMARY KEY CLUSTERED);
 
INSERT INTO #BusinessEntityId
SELECT BusinessEntityID FROM Person.Person;
 
SET STATISTICS IO ON;
SET STATISTICS TIME ON;
 
SELECT COUNT(*)
FROM #BusinessEntityId b
INNER JOIN Person.Person p
       ON b.BusinessEntityID = p.BusinessEntityID;
 
SET STATISTICS IO OFF;
SET STATISTICS TIME OFF;
 
DROP TABLE #BusinessEntityId;

Here’s the plan:

Variables8

You’ll hopefully see that this plan is different. It’s using something called a Hash Match operator – this creates a hashed value based on each row from the top input (the index scan on our temp table), and then feed into that an index scan from our bottom input – the Person table.

Rather than going into too much the implications of that, let’s jump straight to looking at the Tool Tips.

From the top operator first, the Clustered Index Scan:

Variables9

We can see here that the actual and estimated number of rows are both the same. Good so far.

Now to look at the bottom operator, the index scan on the Person table:

Variables10

Now we can see that the estimated and actual number of rows are the same again, but also the estimated and actual number of executions are the same – 1 in both cases.

So we have the query with the table variable generating an execution plan that results in nearly 20,000 seeks against an index vs. the query with a temp table generating 1 scan against the same index. In general seeks are quicker than scans, but probably not 20,000 times quicker, even with the effort of the Hashing part of the operation. To quantify the difference we can look at the output from the STATISTICS IO command for each query:

Table variable:

Table ‘Person’. Scan count 0, logical reads 59916, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table ‘#B182BEEB’. Scan count 1, logical reads 35, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 47 ms, elapsed time = 39 ms.

Temp table:

Table ‘Workfile’. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table ‘Person’. Scan count 1, logical reads 67, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table ‘#BusinessEntityId___________________________________________________________________________________________________00000000013D’. Scan count 1, logical reads 35, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 15 ms, elapsed time = 13 ms.

The output from the temp table query looks more complicated as it mentions a Workfile and Worktable that don’t get used, but it’s easy enough to see that Temp Table runs 3-4 times quicker, and more significantly generates only about 100 logical reads as opposed to about 60,000 for the table variable query. That’s a VERY big difference.

So like I said at the top, if you’re thinking about using a table variable, think twice, it might not be what you want to do.

When do Distribution Statistics Get Updated?

July 5, 2017January 18, 2021 ~ Matthew McGiffen ~ 5 Comments

An updated version of this post can be found here.

Statistics objects are important to us for allowing SQL to make good estimates of the row-counts involved in different parts of a given query and to allow the SQL Optimiser to form efficient execution plans to delivery those query results.

Statistics get updated automatically when you rebuild (or re-organise) an index they are based on – but we only tend to rebuild indexes that are fragmented, and we don’t need fragmentation for statistics to be stale. We also may have many auto-created statistics objects that are not related to an index at all.

It’s generally recommended to have the database level setting AUTO_UPDATE_STATISTICS turned on, so that SQL can manage the process of keeping statistics up to date for us. The only excuse to turn it off is that you are managing the updates to stats yourself in a different manner. And you can always turn the auto update off at an individual table or statistics level if you need to, rather than for the whole database.

SQL Server has had the ability to automatically update statistics since version 7.0. Nonetheless for a long part of my career working with SQL Server, whenever a performance issue raised its head everyone’s knee-jerk response would be “Update Statistics!” In most cases though the people shouting that didn’t really understand what the “Statistics” were, or what mechanisms might already be in place for keeping them up to date.

Of course SQL Server isn’t perfect and sometimes it is helpful for human intelligence to intervene. But to provide intelligent intervention one has to understand how things work.

So how does the automatic updating of statistics work?

In the background SQL maintains a count of changes to tables that might affect statistics. This can be updates, inserts or deletes. So if I inserted 100 records, updated 100 records and then deleted 100 records, I would have made 300 changes.

When SQL forms an execution plan for a query it references various distribution statistics objects to estimate row-counts and to use that to try find the best plan. The statistics objects it looks at are referred to as being “interesting” in the context of the query.

Before using values from the statistics, the Optimizer will check to see if the statistics are “stale”, i.e. the modification counter exceeds a given threshold. If it does, SQL will trigger a resampling of the statistics before going on to form an execution plan. This means that the plan will be formed against up to date statistics for the table.

For subsequent executions of the query, the existing plan will be loaded from the plan cache. Within the plan, the Optimiser can see a list of the statistics objects that were deemed “interesting” in the first place. Once again it will check each of them to see if they are “stale”. If they are, an auto-update of the statistics object(s) will be triggered and once that is complete the plan will be recompiled, in case the updated statistics might suggest a better way of executing the query. Equally, if any of the statistics objects have been updated since the last execution then the plan will also be recompiled.

One important caveat to this is the database level setting AUTO_UPDATE_STATS_ASYNC (Asynchronously). Generally it is best to have this turned off, in which case the above behaviour is observed. If you turn it on however, in the case of stale stats the query execution will not wait for the stats to be updated, but will start them updating in the background while the query executes. The plan will only recompile to be based on the new stats at the next execution.

From SQL Server2008 R2 SP2 and SQL Server 2012 SP1 we have a new DMF (Dynamic Management Function) sys.dm_db_stats_properties that allows us to see how many row modifications have been captured against a given statistics object as well as when it was last refreshed, how many rows were sampled etc. Modifications are captured on a per column basis (though when statistics were originally introduced in SQL Server it was per table) so the counter will only be affected if the leading column for the statistics object has been affected by a given operation.

SELECT s.name AS StatsName, sp.* FROM sys.stats s CROSS apply sys.dm_db_stats_properties(s.OBJECT_ID, s.stats_id) sp WHERE s.name = 'IX_Test_TextValue'

Results:

So what are the thresholds?

For a long time the thresholds were as follows. Statistics were considered stale if one of the following was true:

The table size has gone from 0 rows to more than 0 rows
The table had 500 rows or less when the statistics were last sampled and has since had more than 500 modifications
The table had more than 500 rows when the statistics were last sampled and the number of modifications is more than 500 + 20% of the row-count when the statistics were last sampled (when talking about tables with larger row-counts a lot of the documentation just describes this as 20% as the additional 500 becomes less and less relevant the larger the number you are dealing with).

Those thresholds did mean that when a table had a large number of rows, Statistics might not get updated that often. A table with a million rows would only have stats updated if about 200,000 rows changed. Depending on the distribution of the data and how it is being queried this could be a problem.

So, in SQL 2008 R2 SP2 Microsoft introduced Traceflag 2371 which when set would reduce the stale statistics threshold for larger tables. From SQL 2016 this is the default functionality.

That adds the following test for statistics being stale:

If the number of rows (R) when the statistics were last sampled is 25,000 or more and the number of modifications is more than the square root of R x 1000:

Statistics_1000R

Now, I’m just going to correct myself here, the documentation I’ve found SAYS the threshold is 25,000 but when I started to have a play that didn’t seem to be the case at all.

What actually seems to happen is that whichever of the two estimates is smaller gets used i.e

Either:

Statistics_20pcnt

Or:

Statistics_1000R

Whichever is smaller.

I don’t know if this means that both get evaluated and the smaller is used, or if the threshold between the two rules is simply defined at the point where the second formula gives the smaller result – which is after 19,682 rows. I discovered that threshold by solving where the two equations above would give the same result – then by experimenting to prove it in practice.

I think this incorrect stating of 25,000 as the threshold probably comes from confusion, taking an approximation (20%) as the actual figure. Remember I mentioned that people often generalise to say that statistics are stale after 20% of the rows change, and forget about the extra 500 rows. If that was true and it was exactly 20%, then the threshold would be 25,000 as that would be the point that both equations are equal.

Anyway it’s not even vaguely important to know that. I just found it interesting! Note that the tests above were carried out on SQL Server 2012 SP3 so could well be different on later versions.

To more visually understand the above rules, here’s a table showing the thresholds for some example table sizes under both the Old algorithm (without the traceflag) and the New algorithm (with the traceflag or on SQL 2016 or later).

R is the number of rows when the statistics were last sampled and T is the number of modifications for statistics to be considered stale:

Statistics_Thresholds

You can see for the larger table sizes there is a massive difference. If you’ve got large tables you’re querying against and are having to update the statistics manually to keep them fresh then you may find implementing the traceflag is a help.

For large tables statistics are sampled when being updated rather than the whole table being necessarily being read. I have details on that in this post:

Automatic Sample Sizes for Statistics Updates

Row-count Estimates when there are no Statistics

June 28, 2017 ~ Matthew McGiffen ~ Leave a comment

I was discussing Cardinality Estimation with a colleague recently and the question came up, what cardinality does SQL Server use if you’re selecting from a column where there are no statistics available? I’ve discovered there are a few algorithms in play depending on how you’re querying the table. In this post we’ll look at where we have a predicate looking for a fixed value.

(If you want the short answer, it’s the fourth root of n cubed before SQL 2014 and the square root of n after that.)

This scenario can occur if you have AUTO CREATE STATISTICS turned off for your database, which we don’t recommend you do, but which you might choose to do anyway, and if you query a table with a predicate against a column with no index defined against it.
Let’s look at example. I’m using SQL Server 2012 for this investigation, and to start with am querying the AdventureWorks2012 database.
I’ve taken the following preparatory steps:

Set AUTO CREATE STATISTICS OFF for the database
Remove the Index on the LastName column for the Person.Person table
Removed any ad-hoc statistics that existed against the table

Then I run a simple query, with the Actual Execution Plan turned on:

SELECT *
FROM Person.Person p
WHERE p.LastName = 'Fox';

I only get one result out as there is only one Fox (Ms. Dorothy J.). Let’s look at the execution plan:

RowCountNoStatistics1

A clustered index scan as we might expect as I’ve removed any useful indexes from the table. You’ll notice there is a warning. If we view the tooltip you’ll see SQL warns us about the lack of statistics:

RowCountNoStatistics2

If we look at the estimated and actual row-counts we’ll see how that has affected us:

RowCountNoStatistics3

In the absence of any useful information – it knows the number of rows in the table but that is about it – SQL has estimated that there will be 1680 Foxes in the table. A bit of playing shows that we get the same estimate whatever value we search for.

If I turn AUTO CREATE STATISTICS on and run the query again then SQL generate a Statistics object against the LastName column and comes up with an estimate of 2.3 rows – which is a lot closer.

This matters a lot once you start running more complicated queries. The incorrect estimate is likely to affect the choice of plan that the optimizer makes, and may also affect the amount of memory it requests in order to run the query. Let’s look at a quick example of how the plan changes if we join the above query to another table.

First, without statistics (so I have to turn AUTO CREATE off again, and remove the statistics that got created):

SET STATISTICS IO ON;

SELECT e.EmailAddress
FROM Person.Person p
INNER JOIN Person.EmailAddress e
    ON p.BusinessEntityID = e.BusinessEntityID
WHERE LastName = 'Fox';

Here’s the execution plan:

RowCountNoStatistics4

You can see I’ve got a Merge Join as SQL thinks I’m expecting 1680 rows from the top table. A Nested Loops join would generally be better when I only expect one or two rows from that table.

I’ve also captured the IO so I can see how expensive the query was:

Table ‘EmailAddress’. Scan count 1, logical reads 10, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table ‘Person’. Scan count 1, logical reads 3818, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Let’s look at the behaviour of the same query with statistics creation enabled:

RowCountNoStatistics5

You can see we now have the desired Nested Loops join and the Clustered Index Scan on the EmailAddress table has been changed to a Seek.

The IO output is below:

Table ‘EmailAddress’. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table ‘Person’. Scan count 1, logical reads 3818, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

There’s not much difference in the overall IO, but you can see the Reads for the EmailAddress table have dropped from 10 to 2 due to the change from the Scan to the Seek. If the table was a lot bigger then we could see a large difference here.

So where does that estimate come from?

I thought I’d have a play and see if I could work out how SQL decided on that estimate of 1680 rows. I did some googling and found a suggestion that it might be a straight 9% of the total number of rows in the table, but that doesn’t quite add up and when I compared the same query pattern against a few tables I found I got a different ratio depending on the amount of rows in the table.

So I pumped rows incrementally into a fresh table and looked at the estimate and what the ratio was as the number of rows increased. Here’s my SQL for that:

--Create a Horrible Heap for my testing  
CREATE TABLE dbo.TestStats(TestVal INT NOT NULL, TestText VARCHAR(255) NULL);

--Insert a bunch of rows using dodgy old-style cross joins
INSERT INTO TestStats
SELECT TOP 1 --Amount of rows I'm going to insert
    1 , 'blah'
FROM sys.objects a, sys.objects b, sys.objects c, sys.objects d, sys.objects e;

--Clear the plan cache so SQL generated a new estimate  
DBCC freeproccache;

--Query the table, then I go check the execution plan generated 
--for the estimated number of rows
SELECT * FROM dbo.TestStats
WHERE TestVal = 1;

(One thing to note was that I got the same answers whether I was querying the text column or the integer column – SQL seems to use the same algorithm for both.)

I started to notice a pattern quite quickly, that the ratio halved when the number of rows went up by a factor of 16. I then restarted my test, targeting my row-counts to be where the estimated number of rows would be a nice round number. You can see that in the table below:

RowCountNoStatistics6

I then attempted to work out a formula for that. Rather than take you through the shoddy process of mathematics that led me to an answer, I’ll just tell you that the formula came out as:

Where e is the estimated number of rows for a given predicate value, and n is the total number of rows in the table. I checked that against the full set of results I’d gathered and it held true across all values of n I’d tested.

To check it finally against my original query – the Person.Person table had 19,972 rows. I put that through the calculator with the formula and get 1680.027. If we look back at the original estimate you’ll see that SQL stated 1680.03 – so that is all good.

As I mentioned earlier I was using SQL Server 2012 for this test, and a new Cardinality Estimator came into effect in SQL 2014. So I thought I’d run the test again with SQL 2016 and see if the results changed:

RowCountNoStatistics7

We can see the estimated rows drop off a lot quicker here. Clearly Microsoft have decided to lower the estimate. Actually it is now just the square root of the total number of rows.

RowCountNoStatistics8

Hopefully you’re not in the scenario where you regularly have queries running without the appropriate statistics to support them. The above comparison though shows us that if you have such a query its behaviour could dramatically change if you upgrade your SQL Server version to 2104 or higher. It could become better or it could become a lot worse.

There are a lot of changes like this that came in with the new version of the Cardinality Estimator in 2014. Places where underlying assumptions have been adjusted to make better guesses about the number of rows that will be returned by an operator. But they are still guesses based on the same information – there is no new data being captured in the Statistics to better inform the process. Of course Microsoft has made these changes to try and better model data out in the wild – but they are still fixed assumptions, which means sometimes they will be better and sometimes they will be worse.

One thing I should re-iterate is that these formulae we’ve discovered above are for a fairly specific querying pattern. There’s no guarantee that the calculation will be the same for a similar – but different query. It might be interesting to explore that further in a later post.

Also there may be other information in your database – such as constraints – that SQL can use to educate its guesses.

The main takeaway from all of this though, should of course be:

MAKE SURE AUTO CREATE STATISTICS IS TURNED ON FOR YOUR DATABASES!

Statistics and Cardinality Estimation

June 20, 2017February 1, 2021 ~ Matthew McGiffen ~ 1 Comment

There is an updated version of this post here.

Cardinality

This is a term originally from Mathematics, generally defined as “The number of objects in a given set or grouping”. In SQL we’re continually dealing with sets so this becomes a very relevant topic, which in our context is just the “number of rows”.

When you have a query across multiple tables there any many ways in which SQL Server could decide to physically go about getting you the results you want. It could query and join the tables in any order and it could use different methods for matching records from the various tables you have joined together. It also needs to know how much memory to allocate for the operation – and to do that it needs to have an idea of the amount of data generated at each stage of processing.

A lot of this requires cardinality estimation, and SQL Server uses something called Statistics objects to perform that calculation.

Let’s look at a simple example:SELECT * FROM Person.Person p INNER JOIN Person.[Address] a ON p.AddressId = a.AddressId WHERE p.LastName = 'Smith' AND a.City = 'Bristol'

When it comes to gathering the results for this query there are a number of ways the database engine could go about it. For instance:

a) It could find all the records in the Person table with a LastName of Smith, look each of their addresses up and return only the ones who live in Bristol.
b) It could find all the Addresses in Bristol, look up the people associated with each address and return only the ones called Smith.
c) It could grab the set of people called Smith from the People table, grab all the addresses in Bristol, and only finally match up the records between those two sets.

Which of those operations is going to be most efficient depends very much on the number of records returned from each table. Let’s say that we have a million people called Smith, but there’s only one address in our whole database that is in Bristol (and let’s say that address does actually belong to someone called Smith).

In the first method above I would grab the million Smiths and then look their address up one by one in the address table until I found the one that lived in Bristol.

If I used method b) though, I would find the one matching address first, and then I would simply look up the owner of that address. Clearly in this rather contrived example, that’s going to be a lot quicker. So if SQL knows ahead of time roughly how many records to expect from each part of the query, hopefully it can make a good decision about how to get the data.

But how can it work out how many rows will be returned without actually running the query?

Statistics

That’s where statistics objects come in. SQL Server maintains in the background data that equates to a histogram showing the distribution of the data in certain columns within a table. It does this any time you create an index – statistics will be generated on the columns the index is defined against, but it also does it any time it determines that it would be useful. So if SQL encounters a Where clause on Person.LastName – and that column isn’t involved in a useful index, SQL is likely to generate a statistics object to tell it about the distribution of data in that column.

I say “likely to” because it actually depends on the settings of your SQL instance. Server configuration is beyond the scope of this post but suffice to say you can let SQL automatically create Statistics objects – or not. You can let it automatically update them when the data has changed by more than a given threshold – or not. And you can specify whether updates to statistics should happen asynchronously or synchronously – i.e. in the latter case if your query determines that statistics needs updating then it will kick that off and wait until the update is complete before processing the query.

It’s generally recommended that auto creation and updating is on, and async updating is off.

Viewing Statistics Objects
Let’s have a look at some actual statistics and see what they hold. There are a couple of ways of doing this, the first is through SSMS. If you look under a table in the object browser you will see a Statistics folder which holds any statistics objects relating to that table:

Statistics1

In the above example you can see some that have friendly names, these are Statistics that are related to an actual index that has been defined on the table and they have the same name as the Index – e.g. IX_Address_StateProvinceId.

You’ll also see we have some prefixed _WA_Sys and then with some random numbers following. These are statistics objects that SQL has created automatically on columns that aren’t indexed – or at least they weren’t indexed at the time the Statistics objects were created.

You can open these up with a double-click and see what’s inside:

Statistics2

This is the General tab you open up to. You’ll see it tells you what table the Statistics are for and what column(s). There are options for adding columns and changing the order – but you never really need to do this – as well as information to tell you when the statistics were last updated, and a check box if you want to update them now.

In the details tab there’s a lot more info:

Statistics3

I don’t find this the easiest display format to work with though, so rather than delving into what everything means here let’s look at the other way you can view statistics, which is by running the following command:

DBCC SHOW_STATISTICS('Person.Address', '_WA_Sys_00000004_164452B1')

The format is straightforward, you just specify the table you are interested in the Statistics for, and the actual name of the Statistics object you want. You can see the same information as if you double-clicked on it, but the results are output in the results pane like any other query and are (I think) a lot easier to read. Allegedly there will soon be a third way in SQL Server to view Statistics as DBCC commands are considered a bit “clunky” – but we don’t know what that will look like yet.

The command outputs three resultsets:

Statistics4

This post is just an introduction to statistics – and generally you don’t need to know that much, it’s just handy to understand the basics. So let’s just run over the key bits of information you can see above:

First of all in the first recordset – otherwise know as the…

Stats Header

Rows – is the number of rows in your table

Rows Sampled – this is how many rows were sampled to generate the statistics. SQL can generate or update statsitics using sampling rather than reading all the rows. In this case you’ll see it did actually read the whole table.

Steps – If you imagine the statistics as a bar chart – this is the number of bars on the chart. Statistics objects have a maximum of 200 steps so if you have more distinct values in your column than that they will be grouped into steps.

Density – This is supposed to be the probability of a row having a particular value (calculated as 1 / Number of Distinct values in column). According to books online “This Density value is not used by the query optimizer and is displayed for backward compatibility with versions before SQL Server 2008.” I am using SQL 2012, and this number is just plain incorrect so don’t use it…

Recordset Number 2: The Density Vector

All Density – this is the accurate version of the Density statistic described above. So your probability of a given row having a specific value is about 0.0017. That’s a little less than one in 500. I happen to know there are 575 different Cities in the table so that makes sense. Sometimes SQL will use this value to form a plan – if it knows you’re going to search this table for a specific City and it doesn’t know that City when it makes the plan, then it could guess that about 1/500th of the rows will match your criteria.

Average Length – Is what it says on the can. The average length of data in this column.

Columns – The names of any column measured in this statistics objects. You can have statistics across multiple columns but I’m not going to cover that in this post. In this case it tells us these statistics are based on the “City” column.

Recordset Number 3: The Histogram

This last recordset shows the distribution of the data, and is what you could effectively use to to draw a graph of the relative frequencies of different groups of values. Each row represents a step – or bar on the bar chart – and as mentioned above there can be a maximum of 200 steps. So you can see, a statistics object is quite lightweight, even for a massive table.

RANGE_HI_KEY – This upper limit of each step, so each step contains all the values bigger than the RANGE_HI_KEY of the last step, right up to and including this value.

RANGE_ROWS – This is how many rows in the table fall in this range – not including the number that match the RANGE_HI_KEY itself.

EQ_ROWS – The number of rows equal to the HI_KEY

DISTINCT_RANGE_ROWS – The number of different values in the range that there is data for (excluding the HI_KEY).

AVERAGE_RANGE_ROWS – The average number of rows for a given value within the range.

That’s a whistle-stop tour of the Statistics SQL Server holds on your data.

The algorithms that SQL then uses to calculate the number of rows for a given part of your query are pretty transparent when there’s just one column involved. If we look at the above example and let’s say you wanted to look up the rows where the City is “Abingdon” – the statistics tell us there is 1 matching row and that’s the figure SQL will use for cardinality estimation. Where a value is within a range then it will use a calculation based on the AVERAGE_RANGE_ROWS.

When there’s multiple columns involved it’s more complicated there are various algorithms and assumptions that come into play. If you’re interested in digging deeper, one very good resource is the Whitepaper on the 2014 Cardinality Estimator written by Joe Sack: https://www.sqlskills.com/blogs/joe/optimizing-your-query-plans-with-the-sql-server-2014-cardinality-estimator/

Conclusions

The main takeaway from this should just be the understand the extent – and limitations – of the information about the distribution of your data that SQL holds in the background.

If, when you’re tuning queries, you notice that the estimated row counts don’t match the actual, then this could be encouraging SQL to form a bad plan for the query. In these cases you might want to investigate what’s going on with the statistics.

Maybe your query is written in a way that it can’t use statistics effectively, one example of this can be where you store constant values in variables, then query using that variable in a WHERE clause. SQL will then optimise based on the average, rather than on your actual value.

Maybe the plan is based on one data value that has a very different cardinality to the one currently being queried. For instance when you first run a stored procedure, the plan is formed based on the parameters passed. Those parameters could have a cardinality that is quite different to those used in later executions.

Maybe the statistics are out of date and need refreshing. Stats get updated when approximately 20% of the data in the table changes, for a large table this can be a big threshold, so current statistics may not always hold good information about the data you are querying for.

Or maybe SQL is doing as good a job as it can with the information it has at its disposal. You might need to work out how you can give it a little extra help.

This is in only intended to be an introduction to Statistics objects. If you want to find out more, your favourite search engine is your friend. For the moment, I hope this makes it clear what Statistics objects are, what they are used for, and why they are important.

What’s an “Index” and how do they work?

June 12, 2017 ~ Matthew McGiffen ~ 2 Comments

What is an Index?

We often hear indexes explained using the analogy of an index in the back of a book. You want to find the information about “rabbits” for instance – and so you look that up in the back and find the list of pages that talk about rabbits.

That’s kind of a handy description but I find it’s not the best for understanding what’s really going on. I often describe an index as a subset or different representation of the data in your table, ordered to be of particular benefit to querying the data in a certain way. Maybe you want to query the data by Created Date – let’s have an index then in the order of that date to make it easy to find stuff.

But important to understand is…

Your table is itself an index
At least I hope so. When you create a table, hopefully you define a primary key. When you do so SQL Server will automatically define what’s called a Clustered Index based on whatever you’ve specified as the primary key. The Clustered Index (and you can only have one on a table for obvious reasons) defines the order that the data in the table will be stored on disk. The data itself is stored in a structure that is the same as any other index you define on the table (we call these non-clustered indexes) just with a few differences in exactly what is stored.

So there is no separate “table” stored on the disk in some manner, there are just indexes – the Clustered one and any non-Clustered indexes you create.

I say I hope though – because you might have decided – or have neglected – to define a primary key. Or you might have deliberately stopped SQL from creating a Clustered index. Then you are saying you don’t want to specify in what order the data is stored – “Just chuck it down there, I don’t care!” and then you have what is known as a Heap. A Heap is just a pile of data in no specified order, and whenever you want to read from it you’re going to have to scan the whole thing to find whatever you want.

We don’t like Heaps!

You might think “well, that sounds like a quick way of writing data to disk, and I’m hardly ever going to read the data from this table but I write to it a lot, so a heap sounds great”.

True, there are some edge cases where a heap is appropriate, but they really are edge cases. If you ever think a heap is the way to go, then set up a scalability test to prove it achieves what you want.

In general when you’re creating a new table, you want to think carefully about what the primary key and clustered index should be (maybe they should, but maybe they shouldn’t be the same), based on how the data is going to be written and read.

What does an index look like?
So you’ve got a table, and its physical representation on disk is as a clustered index, the picture below shows how that actually looks in logical terms:

Clustered_Index

This something called a B-tree, the B is generally agreed to stand for “balanced”, the idea being that you have to traverse the same amount of steps to find any piece of data. I don’t want to go into process this in ultimate depth, but let’s cover what’s useful.

We have a Root Node and an Intermediate Level – these (like all data in SQL Server) are made up of 8KB pages. At the root and intermediate level these just contain the Clustered Key ranges (as noted above this defaults to the Primary Key you defined) and pointers where to find those Key values on the next level down. Then we have the leaf level which contains the data itself, 8KB pages containing the complete rows of data. So if we’re looking for a particular record e.g. Id = 12, we can route quickly through the top levels to find exactly the page we want. This is much quicker than having to scan all the Leaf pages.

A non-clustered index is basically the same structure:

NonClustered_Index

You’ll see that the Root and Intermediate levels contain both the Index Key and the Clustered key. The Index key is whatever you decided to define an index on – in the above example this might be Created Date, and assuming we’re searching based on that SQL will use this index to quickly find the leaf page containing the data we want.

The main difference is at the leaf level. Here we don’t hold the complete row of data. We hold the Index key and the clustered key. If your query just wants to grab the clustered key value (often the Id) then that is fine. If it wants additional data then it can use that key value to look up the data you require in the Clustered Index. So effectively your query uses this index to gather the Clustered Index value (or values) for the rows it wants and then uses the Clustered index to get any extra data it wants.

You’ll see that can mean that we’re actually using two indexes to satisfy a query. That’s why we also have at the leaf level what you’ll see referred to as “Included Columns”. When you define an index you can choose extra columns to be stored in the leaf level pages (using the INCLUDE statement). Let’s say I’m looking at a table called Users and I want to be able to grab the email address for a user called “mcgiffen”. I can create an non-clustered index on Username and tell SQL to include the column EmailAddress. Then the query will find all the data it needs in the non-clustered index and doesn’t need to refer back to the clustered one.

When we create a non-clustered index that will serve all the needs of a particular query in this way (or set of queries) we call that a covering index for that query. A query is always going to be faster if it has a covering index.

But doesn’t that mean I’m writing the same data in multiple places?
You’ll see from the above why I refer to an index as a subset/different representation of the table itself (which hopefully is also an index). And yes – that does mean that when you’re writing to a table with multiple indexes you are actually writing some of the same data to multiple separate structures. I know… that sounds horribly inefficient.

Let’s think about a table with 100,000 rows containing information about people. Let’s say I sometimes access the data by Id value and that is the primary and clustered key. But I also access it based on the person’s name and that’s equally often so I’ve created an index on their name.

I also read from the table a lot more than I write to it, I want to access the data about a person far more often than I want to change it. Sure, when I write to it, I have to write some of the data twice, but if I didn’t have that extra index on their name, then every time I wanted to query by a person’s name I’d have to read all 100,000 rows. So I have to do one more write, but I save all those reads. That’s a pretty good trade off.

It does mean though that you want to be effective with your indexing, you don’t just want to throw every index you can think of at a table and you want to try and make indexes that can satisfy the needs of multiple queries.

From SQL Server 2008 you can have up to a thousand indexes on a table. Generally though, up to 4 seems reasonable – and if you’ve got more than 15 then that’s raising a red flag.

However there’s no hard and fast answer about how many indexes you should have, it depends on the balance of reads and writes, and it depends on the different types of access you have to the table. In general, it is good to have an index on any column that acts as a Foreign Key, and columns that will be used as the main predicates when querying a table are also good candidates to be indexed.

But be careful when adding a new index and monitor to make sure it hasn’t actually slowed things down – not just the query you’re working on but any other existing queries that use the same table as your index may change the way they work too.

Statistics Parser

May 31, 2017 ~ Matthew McGiffen ~ 1 Comment

In the last post we looked at using the STATISTICS IO and STATISTICS TIME commands to measure query performance.

If you’ve started using these, you may notice that once you start to troubleshoot longer scripts or stored procedures that reference lots of tables or have multiple queries, you start getting lots of output. Sometimes it can sprawl across several screens worth. That can make it a a bit of a pain to total up the numbers, or to quickly see if a change you have made has had a positive effect.

Here’s the STATISTICS output from a troublesome query I was looking at recently:
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 0 ms.
SQL Server parse and compile time:
CPU time = 140 ms, elapsed time = 140 ms.

SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.
(188 row(s) affected)

Table ‘ParameterValue’. Scan count 14, logical reads 2350, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘ParameterDefinition’. Scan count 8, logical reads 456, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘Worktable’. Scan count 8, logical reads 105600, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘Workfile’. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘ParameterDataType’. Scan count 0, logical reads 376, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 2371 ms, elapsed time = 655 ms.

SQL Server Execution Times:
CPU time = 2511 ms, elapsed time = 797 ms.

Now that’s not too bad, and it’s fairly simple for me to add up the numbers of reads. But you also can find yourself copying and pasting the output from several iterations of changes into notepad documents or query windows so you can compare them later, and it can become a bit of a pain to keep track of.

That’s what I used to do, until I discovered the STATISTICS parser tool.

This a free web-based tool to help you keep track of your STATISTICS output, which will also format the results nicely for you and produce totals. It’s very simple and easy to use:

http://statisticsparser.com/
StatisticsParser1
You simply paste the STATISTICS output into the big box and hit the “Parse” button. You can also enter a Page Title that will become the name of the Tab in your browser to make it easier to keep track of different sets of output.

The parser will turn the set of output for each query into an easy to read table and then will also provide a Totals table that shows you (not surprisingly) the totals across all your queries. Here’s the output above formatted via the tool (I am just showing the final totals):

StatisticsParser2

Personally I find that significantly easier to read!

Watch out though, the total CPU time can be double-counted. This is because the STATISTICS TIME output can sometimes include a total itself. If I look at just the TIME sections from the output above I have:

SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 0 ms.
SQL Server parse and compile time:
CPU time = 140 ms, elapsed time = 140 ms.

SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.

SQL Server Execution Times:
CPU time = 2371 ms, elapsed time = 655 ms.

SQL Server Execution Times:
CPU time = 2511 ms, elapsed time = 797 ms.

If you look carefully at that you’ll see that the last set of entries (CPU time = 2511ms, elapsed time = 797ms) are totals of the other numbers. If you then look at the Statistics Parser output above you’ll see that it has added those to the previous outputs to give a total CPU time of 5.022 seconds (5,022ms).

Still we’re mainly interested in getting the IO down so I ignore that for now. The main thing I’m interested in is the total number of reads:
108,782

I make my changes to the stored procedure, execute it and get the STATISTICS output again:

SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 0 ms.
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 0 ms.

SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.

SQL Server Execution Times:

CPU time = 16 ms, elapsed time = 0 ms.

SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 1 ms.

(189 row(s) affected)

SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.

(188 row(s) affected)
Table ‘Workfile’. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘ParameterDefinition’. Scan count 2, logical reads 396, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘ParameterValue’. Scan count 3, logical reads 708, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘#ParameterNames_____________________________________________________________________________________________________0000000000CB’. Scan count 2, logical reads 718, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘ParameterDataType’. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 78 ms, elapsed time = 80 ms.

SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.

SQL Server Execution Times:
CPU time = 94 ms, elapsed time = 83 ms.

You can see my output is getting longer and starting to become more challenging to read – I broke the original query down into a few simpler steps, but each produces its own output. No worries, I open up a new tab in my browser and go to a second instance of the Statistics Parser. I give this one the Page Title “After”. I paste in my output and I hit parse:

StatisticsParser3

You can now see I have a “Before” and an “After” tab so it’s easy to flick back and forth if I’ve forgotten my totals.

I can now easily see that my total number of reads – previously 108,782 – is down to 1,824. That’s about 50 times better. The CPU time is down by a factor of about 25, not as good as the reads, but I call that good enough and proclaim that tuning task complete!

Measuring SQL Query Performance

May 24, 2017 ~ Matthew McGiffen ~ 1 Comment

When running workshops on the subject of SQL Query Performance the first question I always ask is how do we measure the performance of a query? Or to look at it another way – how do we measure the resources it consumes?

Attendees have come up with a range of answers from “With a stopwatch” (which I like a lot) to the slightly more technical “Using Profiler”. I like the first answer just because that’s where we all start. We run something from an application and we literally time how long it takes, or we run something in SSMS and we use the counter near the bottom right of our query window to tell us how long it took. We may even run something 100 times in a loop and capture the overall time so we can take an average.

SQL Server Profiler is also good when trying to capture what is actually happening in a given environment.

But there is a better way…

Using STATISTICS commands

When we’re working on a query in SSMS there is an easier, more accurate, and more detailed approach than any of the others available, which is to use the following two commands:
SET STATISTICS TIME ON;

This command tells SSMS to capture the consumed CPU time and elapsed time related to any queries you subsequently run in that query window. The output is sent to the Messages tab.

SET STATISTICS IO ON;

This command tell SSMS to capture the amount of data read from disk or memory by any subsequent queries in that window. Again the output is sent to the Messages tab.

Let’s look at some quick examples. For these I’ll use the StackOverflow database (here’s a post from Brent Ozar on how you can download this database for your own use https://www.brentozar.com/archive/2015/10/how-to-download-the-stack-overflow-database-via-bittorrent/):

SET STATISTICS TIME ON;
SELECT COUNT(*) FROM dbo.Users;

Output:
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 1 ms.
(1 row(s) affected)

SQL Server Execution Times:
CPU time = 422 ms, elapsed time = 2296 ms.

You can see we now have a fairly accurate measure of how long that query took to run (and how long to parse and compile). The CPU time is how much CPU is used, the elapsed time is how much time the query took overall. You’ll see the elapsed time is much longer the the CPU time, this generally means SQL is waiting for something the rest of the time, perhaps for data to be read from disk or memory, or perhaps something else (like locking and blocking).

Now let’s look at STATISTICS IO for the same query.
SET STATISTICS IO ON;
SELECT COUNT(*) FROM dbo.Users;

Output:
Table ‘Users’. Scan count 5, logical reads 24914, physical reads 1, read-ahead reads 24770, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

You’ll see that there’s a bit more info here, let’s run through each piece in order:

Table – First we have the name of the table the reads are being generated from – when we have a bunch of tables referenced this can help us to work out where to focus our tuning efforts.
Scan Count – In simple terms this is the amount of times the table was accessed. This isn’t always a massively useful metric. In can also be Zero in some specifics scenarios such as searching for one specific value in a clustered index.
Logical Reads – this is the number of data “pages” read from memory. A data page is 8KB in size, so this query has read about 200MB from the User table (which is basically the whole table).
Physical Reads – This is the number of data pages read from disk.
Read-Ahead Reads – These are also physical reads. In this case pages read from disk, but in theory read ahead of time e.g. while the processor is doing something else and doesn’t need them yet. In practice SQL has started to read these before it needs them, but is still reading them as the query is consuming them so we do end up waiting for the disk to supply most of the data. If we look back at the CPU time and elapsed time above, it is a pretty reasonable assumption that the time taken up waiting for these physical reads accounts for much of the difference between the CPU time and elapsed time.
LOB – The columns are repeated for LOB data, which is “Large Object” data such as varbinary(max) columns and other such “blobs”. These get stored outside of the main data page. We’re not going to worry about these too much for the moment.

Physical vs Logical Reads

It is worth knowing that SQL (in general) will try to retain data it has read in memory for as long as possible. This is because Logical Reads (from memory) are much much faster than Physical Reads (from disk). You may have heard people saying that SQL Server is very aggressive about using memory, or heard complaints like “SQL is consuming all the memory on the server”. This is why. SQL want to do as much querying as possible from memory and save on those expensive physical reads. We allocate a maximum amount of memory to SQL Server and over time it will use as much of that allowance as it can. This is also one of the reasons why querying slows down after you restart SQL Server, all your queries are going to go back to reading from disk until SQL has that data in memory once more.

It’s also good to understand that in the numbers above SQL is not reading one set of data from disk and another set from memory. Rather it is a two stage process, required data is read from disk into memory (the physical/read-ahead reads) and then the whole set is read from memory (the logical reads). So the number of logical reads includes the number of pages read first from disk.

We can see that effect if we run the query again. This time I’ll include both commands:

SET STATISTICS TIME ON; SET STATISTICS IO ON; SELECT COUNT(*) FROM dbo.Users;

Output:
Table ‘Users’. Scan count 5, logical reads 24982, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 546 ms, elapsed time = 164 ms.

You’ll see our physical reads ( including read-ahead reads) have now dropped to zero. This is because we’ve already read this data from disk the first time we ran the query and SQL has retained the data in memory. Logical reads have stayed about the same. You’ll hopefully also see that our CPU time is reasonably similar at around 500ms, but our elapsed time has dropped dramatically as we no longer had to wait for slow reads from disk.

You might also notice that the elapsed time is now actually shorter than the CPU time which can be a bit confusing. The query has consumed 546ms of CPU time, but only took 164ms from beginning to end. How than this be? The answer is that our query has gone parallel over several CPU cores. Counting all the records in a large table is actually a pretty resource hungry operation as you can see, so SQL has decided to save time by breaking the task into pieces and running those sub tasks concurrently.

I should probably add a caveat here in that I have actually seen STATISTICS TIME return results where the elapsed time is marginally shorter than the CPU time even when the query is single-threaded. I queried this with a performance tuning specialist from Microsoft who I was working with at one of our clients and he admitted that these figures weren’t always 100% accurate – they are however the most accurate you’ll get.

Flushing Data From Memory

You might sometimes want to see how your query would perform if it’s having to read all its data from disk. This might be more appropriate to how it is likely to be working in the wild. In that case you can run the following command which will flush all the data from memory (so don’t run it on a production environment!).
DBCC DROPCLEANBUFFERS;

If you execute this and then were to re-run the above query and capture the IO you’d see that the physical/read-ahead reads are back.

In SQL Server the area of memory where data is stored is referred to as the Buffer Cache. Within that Cache we have clean pages where the data in memory is the same as the data on disk, and dirty pages which are pages that contain modifications that have yet to be written to disk. The above command only flushes (or drops) the clean pages (or buffers). Just in case you were wondering why it was called DROPCLEANBUFFERS…

A Slightly More Complex Example

The above example with just the one table should make it clear what these commands do, but let’s look at a query that goes across a couple of tables and does a bit more work so you can see the sort of output you’ll get:

SET STATISTICS TIME ON; SET STATISTICS IO ON; SELECT TOP 10 u.Id, u.DisplayName, COUNT(p.Id) AS NumberOfPosts FROM dbo.Users u INNER JOIN dbo.Posts p ON u.Id = p.OwnerUserId GROUP BY u.Id, u.DisplayName ORDER BY NumberOfPosts DESC;

Output:
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 54 ms.
(10 row(s) affected)

Table ‘Posts’. Scan count 5, logical reads 51352, physical reads 1, read-ahead reads 51192, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘Users’. Scan count 0, logical reads 94, physical reads 45, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 336, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘Workfile’. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 8002 ms, elapsed time = 5597 ms.

You can see here that the IO for both tables referenced in the query has been captured. This is very useful in telling us where the bulk of the work is being done.

Notable here also are the references to “Worktable” and “Workfile”. These often appear in the output to STATISTICS IO for more complex queries. In general it is worth thinking of these as internal objects SQL uses to store temporary resultsets while processing your query. The difference between the two is the type of resultsets they store and for which particular operations. This can make it a bit more challenging to work out which part of your query is generating the most work when the highest number of reads appear against these. You end up needing to look at the execution plan and trying to work out what operations these could be related to. We may get to more of that in a later post.

Using this information to Tune your Queries

To close, I just want to mention how you should use this information to focus your efforts when tuning queries. Intuitively we all think in terms of time when we think in terms of query performance, but in practice it is much better to focus on the output of STATISTICS IO when you are trying to make improvements.

The IO gives you much more information, and while figures for both CPU and IO will vary depending on what is going on on the server, the IO figures are more static and therefore you can have more confidence that changes you see are the results of your tuning efforts. It also allows you to focus on tables one at a time as you tune multi-table queries.

Both IO and TIME are measures of work being done on a server. If you can reduce the number of reads generated by your query then that is generally the best measure of improvement and will likely have similar benefits to the overall time taken by the query.

So in your tuning efforts, focus on the logical and physical reads and try to get the totals for these down as low as possible. Generally you’ll start with operations on the table that generates the highest number of reads and go on from there.

Splitting Strings on SQL Server 2016

May 9, 2017 ~ Matthew McGiffen ~ Leave a comment

A small change, but a great one, in SQL 2016 is native support for splitting strings.

This has to be about the most common user defined function that people write in SQL Server. I’ve certainly seen it across many companies and clients that I’ve worked for over the years. From use in SSRS, or passing multiple values to stored procedures it’s been fairly ubiquitous.

There are countless blog posts about the best way to do this, it’s virtually a competition amongst the best and brightest SQL minds.

From Jeff Moden’s Tally versions:
http://www.sqlservercentral.com/articles/Tally+Table/72993/

To Adam Machanic’s CLR one:
http://sqlblog.com/blogs/adam_machanic/archive/2009/04/28/sqlclr-string-splitting-part-2-even-faster-even-more-scalable.aspx

Aaron Bertrand has a great post here that outlines many of these methods and does a performance comparison.
https://sqlperformance.com/2012/07/t-sql-queries/split-strings
(SPOILER : Adam Machanic wins)

With SQL 2016 we have a new way, and apparently it’s even substantially faster that Adam M’s CLR. Figures from Aaron again:
https://sqlperformance.com/2016/03/sql-server-2016/string-split

STRING_SPLIT

This couldn’t be simpler, but I’ll detail it here briefly. The syntax is:
STRING_SPLIT (string, separator)So in a basic example:
DECLARE @SplitMyString VARCHAR(255) = 'Value1,Value2,Value3,Value4'; SELECT * FROM STRING_SPLIT(@SplitMyString,',');

And the results:

Easy Peasy!

Matthew McGiffen Data

Uncategorized

Automatic Sample Sizes for Statistics Updates

Statistics and the Ascending Key Problem

The Problem

So what’s the fix?

So should you turn on the traceflag(s) globally?

Welcome to SQL Server 2014 (and later)

Think twice before using table variables

T-SQL Tuesday

Table Variables vs Temp Tables

What’s the big difference?

Example 1: Getting that estimation badly wrong

Example 2: Selecting a dodgy execution plan

When do Distribution Statistics Get Updated?

Row-count Estimates when there are no Statistics

Statistics and Cardinality Estimation

Cardinality

Statistics

Stats Header

Recordset Number 2: The Density Vector

Recordset Number 3: The Histogram

Conclusions

What’s an “Index” and how do they work?

What is an Index?

Statistics Parser

Measuring SQL Query Performance

Using STATISTICS commands

Physical vs Logical Reads

Flushing Data From Memory

A Slightly More Complex Example

Using this information to Tune your Queries

Splitting Strings on SQL Server 2016

STRING_SPLIT