Automatic Sample Sizes for Statistics Updates

February 24, 2021May 10, 2023 ~ Matthew McGiffen ~ 3 Comments

Statistics are vital in allowing SQL Server to execute your queries in the most performant manner. Having a deep understanding of how Statistics work really helps when you are performance tuning

I mentioned in my previous post about manually updating statistics that you can specify whether they’re updated using a full scan, or you can specify an amount of data to sample, either a percentage of the table size, or a fixed number of rows. You can also choose not to specify this, and SQL Server will decide for you whether to do a full scan, or to sample a certain amount of data.

I thought it would be interesting to look at what the sample sizes are that SQL will choose to use, depending on the amount of data in your table. Note that this occurs if you update statistics without specifying how they should be sampled as below:

UPDATE STATISTICS dbo.Test _WA_Sys_00000002_3AD6B8E2;

This is also the behaviour you will get when SQL updates statistics through the auto-stats mechanism. The fact that auto-stats may sample at a lower rate than is optimal for a given table and the queries against it is another reason you may choose to perform manual statistics updates.

To test this, I created a table and progressively pumped data in. Then after I inserted each batch of rows:

I Ran a stats update capturing the CPU time taken
Checked the statistics to see what sample size was used in the update
Checked the size of the index

Here’s some of the code I used for the test:

--Drop table if exists
IF (OBJECT_ID('dbo.Test')) IS NOT NULL DROP TABLE dbo.Test;

--Create table for Testing
CREATE TABLE dbo.Test(
   Id INT IDENTITY(1,1) CONSTRAINT PK_Test PRIMARY KEY CLUSTERED, 
   TextValue VARCHAR(20) NULL
);

--Create index on TextValue
CREATE INDEX IX_Test_TextValue ON dbo.Test(TextValue);

--Insert a bunch of rows
INSERT INTO dbo.Test(TEXTValue) 
SELECT TOP 100000 'blah'
FROM sys.objects a, sys.objects b, sys.objects c, sys.objects d;  

--Update statistics without specifying how many rows to sample
SET STATISTICS TIME ON;
UPDATE STATISTICS dbo.Test IX_Test_TextValue;
SET STATISTICS TIME OFF;

--View the statistics
DBCC SHOW_STATISTICS('dbo.Test', IX_Test_TextValue) WITH STAT_HEADER;

--Check the size of the index
SELECT
    i.name                  AS IndexName,
    SUM(s.used_page_count)   AS Pages
FROM sys.dm_db_partition_stats  AS s 
JOIN sys.indexes                AS i
ON s.[object_id] = i.[object_id] AND s.index_id = i.index_id
WHERE i.name = 'IX_Test_TextValue'
GROUP BY i.name

The results of my testing are shown in the below table:

StatisticsSamplingRates

You can see that we have a full sample being taken for the statistics updates up to 4000,000 records (896 pages) but that once the table size hits 500,000 sampling is happening. If you look at the number of pages you will see we now have over 1,000 pages, 1000 pages being about 8MB of data, which is the threshold that sampling kicks in.

I wasn’t able to find a nice neat formula to determine the sampling rate based on table size, but if we look at the above figures there are still some insights to be gained. The main one is that you’ll notice that even as we double the table size, the number of rows sampled doesn’t go up by much. For instance from 500,000 to a million rows, only 10,000 more rows are sampled. This also means that even for pretty large tables, the update isn’t taking long – another reason why it’s worth leaving auto stats updates enabled and running synchronously with queries – they’re generally not going to take that long.

Another insight is that the percentage of rows sampled drops off very quickly. As the sample size doesn’t really increase that much even when the table size doubles – the percentage sampled has almost halved each time.

Manually updating Statistics

February 16, 2021May 10, 2023 ~ Matthew McGiffen ~ 3 Comments

Having up to date statistics is vital for getting the best performance out of your queries. Even though SQL Server automatically updates statistics in the background for you (When do statistics get updated?), you may find there are times when you want to manage updating them yourself.

You may have large tables and find that the interval between the automatic updates is too big and is resulting in sub-optimal query plans.
You might need timely updates to a specific object – maybe as part of an ETL process to make sure that statistics are up to date after a particular part of the process, perhaps after a daily load into a table.
You may find that the automatic updates look at too small a sample size and you need to scan more of the table to enable accurate estimates for your queries.

My previous post on the Ascending Key problem demonstrated a common issue where the first of these scenarios could be affecting you. We’ll look at examples of the other scenarios in subsequent posts.

For now though, let’s just look at how you go about updating statistics.

UPDATE STATISTICS

At the heart of all the methods we’ll look at is the UPDATE STATISTICS command. There are a lot of options for using this command, but we’ll just focus on the ones you’re most likely to use. For full documentation here is the official reference:
https://docs.microsoft.com/en-us/sql/t-sql/statements/update-statistics-transact-sql

You can use this at the table level in which case all statistics objects belonging to that table will be updated:

UPDATE STATISTICS dbo.Test;

Or you can specify the specific statistics object you want to update:

UPDATE STATISTICS dbo.Test _WA_Sys_00000002_3AD6B8E2;

FULL SCAN or SAMPLE

The other thing you may be likely to want to specify is whether the statistics should be updated using a full scan of the table, or just be looking at a sample of the rows. In the above examples we didn’t specify this so SQL Server will decide for us. In general sampling (as opposed to full scans) kicks in when we have about 8MB of data in the table (or about 1000 pages).

If you want to specify a full scan the syntax is as follows:

UPDATE STATISTICS dbo.Test _WA_Sys_00000002_3AD6B8E2 WITH FULLSCAN;

If you want the statistics update to use sampling (more on how this works in subsequent posts) then you can choose to specify a percentage of the total table to be sampled:

UPDATE STATISTICS dbo.Test _WA_Sys_00000002_3AD6B8E2 WITH SAMPLE 10 PERCENT;

Or you can specify a fixed number of rows:

UPDATE STATISTICS dbo.Test _WA_Sys_00000002_3AD6B8E2 WITH SAMPLE 10000 ROWS;

You might want to use a sample as once your tables get large full scans can take a little time. Equally though if you’re updating statistics in a quieter time (e.g. overnight) you may feel you can afford the extra time for the full scans. Here’s some comparison figures I produced on my local machine showing how long full scans take. Obviously this will change depending on your hardware and other factors including how much of the table is already in memory:

FullScanTimes

You can see however that the time taken pretty much scales linearly as the table size increases.

One thing to be aware of is parallelism. A full scan can run as a parallel operation on your server which can speed it up considerably. When you update statistics using a sample however this can only run single-threaded unless you’re on SQL Server 2016 (or higher). Sampling data to build statistics in parallel was one of the many excellent little improvements in 2016.

sp_UpdateStats

This is another method you might use for manually updating statistics (perhaps as part of a scheduled maintenance job). This system stored procedure can be used for updating all of the statistics objects in a database:

USE Test;
EXEC sp_UpdateStats;

This stored procedure iterates through your database using a WHILE loop and executes the UPDATE STATISTICS command as it goes. One nifty thing about using this procedure is that it only updates statistics objects where rows have changed, so you don’t have any overhead for refreshing statistics where the underlying data hasn’t been modified. You can see this from this extract of the output of the stored procedure:

Updating [dbo].[TestMemory]
[PK__TestMemo__3214EC070D799003], update is not necessary…
0 index(es)/statistic(s) have been updated, 1 did not require update.

Updating [dbo].[TestMemory2]
[PK__TestMemo__3214EC07D3DC52DE], update is not necessary…
0 index(es)/statistic(s) have been updated, 1 did not require update.

Updating [dbo].[Test]
[PK_Test] has been updated…
[IX_Test_TextValue] has been updated…
2 index(es)/statistic(s) have been updated, 0 did not require update.

Ola Hallengren

Of course, if you’re looking to implement statistics update as part of regular maintenance, then you should definitely be considering using Ola Hallengren’s maintenance solution. Ola maintains a great solution for managing database backups and integrity checks as well index and statistics maintenance, and he shares it with the SQL community for free.

You can find full details of Ola’s solution and download it here:
https://ola.hallengren.com/

Book Review: Learn dbatools in a Month of Lunches

February 15, 2021May 10, 2023 ~ Matthew McGiffen ~ Leave a comment

If you work in the world of SQL Server you’ve almost certainly heard of dbatools. For those who haven’t, it’s an open source PowerShell module for automating literally hundreds of tasks on your database instances.

What the rest of you may or may not know, is that the creators of dbatools have been working on a book to make it easy to get started with the tool – Learn dbatools in a Month of Lunches

If you’re saying to yourself, it all sounds good, but I don’t really know PowerShell that well, I’m not confident how I work with modules, open source and otherwise, it all seems like a lot to learn for things I can do already another way… then I highly recommend you check out this book.

Written by Chrissy LeMaire and Rob Sewell this book takes you through the steps to get going before leading you through many of the tasks that dbatools can help you with.

Speaking for myself, I’m not a PowerShell afficionado, I’d struggle to write more than a few basic commands without googling the syntax. Following this book though made it really easy to get going, and I learnt a lot of PowerShell fundamentals along the way. It was certainly a lot easier than if I’d had to figure out things for myself, and it felt like I was aquiring a more complete knowledge that will stand me in good stead. Within a very short time of accessing the live book in the browser, I was set up and running commands with dbatools.

The book isn’t complete yet, but is being published through the Manning Early Access Programme, which means you can access chapters as they are published. Each chapter is a nice bite sized chunk you can consume fairly quickly and painlessly. Everything is explained really clearly – and there are comprehensive code samples you can work through.

If you’re a DBA who is aware of all this stuff, not sure you really need it, but have a niggling doubt that you’re getting left behind and probably should get around to learning it at some point, grab the book and start working through a few chapters. You’ll be operating with confidence in no time.

You can find out more about the book and access some preview content through the Manning website:

Learn dbatools in a Month of Lunches

Statistics and the Ascending Key Problem

February 9, 2021May 10, 2023 ~ Matthew McGiffen ~ 5 Comments

I’ve mentioned previously how not having up to date statistics can cause problems in query performance. This post looks at something called the Ascending Key Problem which can badly affect your cardinality estimation in some cases and therefore your execution plans.

The Ascending Key Problem relates to the most recently inserted data in your table which is therefore also the data that may not have been sampled and included in the statistics histograms. This sort of issue is one of the reasons it can be critical to update your statistics more regularly than the built-in automatic thresholds.

We’ll look at the problem itself, but also some of the mitigations that you can take to deal with it within SQL Server.

The Problem

Imagine you have a table that stores a set of events. As new records are inserted they get stamped with the current date and time. You regularly query that table based on that EventDate looking to find recent events, let’s say just for the current day.

Even if you haven’t indexed the EventDate column (though why haven’t you?!), as long as you have AUTO CREATE STATISTICS and AUTO UPDATE STATISTICS on for your database you’ll have “reasonably” up to date statistics for that column.

But “reasonably” may not be good enough. When you’re querying against the most recent data it may not yet have been sampled by a statistics update and the range you are looking for may fall beyond the top of the histogram captured in the statistics object for EventDate. Imagine that statistics were last updated yesterday. When the Optimizer checks the statistics to estimate a rowcount for today’s date it finds that is above the top bound. So what should it guess?

Historically it would guess that there were zero rows, but as always the cardinality estimation gets set to the minimum of 1. If the real answer is a lot larger you might end up with a bad plan.

Let’s look at that in practice.

Staying true to the example above, I create a table called Events and I index the EventDate column:

CREATE TABLE dbo.Events
(
Id INT IDENTITY(1,1) CONSTRAINT PK_Events PRIMARY KEY CLUSTERED,
EventName VARCHAR(255) NOT NULL,
EventDate DATETIME CONSTRAINT DF_Events_EventDate DEFAULT (GETDATE())
);

CREATE INDEX IX_Events_EventDate ON dbo.Events(EventDate) include (EventName);

Then I insert records to represent events at one minute intervals for 100 days:

--Insert data for 100 days at minute intervals from the start of this year
DECLARE @StartDate DATETIME = '20170101 00:00.00';

INSERT INTO dbo.Events(EventName, EventDate)
SELECT
    'Event' + CAST(num.n AS VARCHAR(10)),
    DATEADD(minute,num.n,@StartDate)
FROM 
(
    SELECT TOP 144000 row_number() OVER(ORDER BY (SELECT NULL)) AS n 
    FROM sys.objects a, sys.objects b, sys.objects c
) num;

I’m going to query to check what date range was inserted. That should have the additional advantage of triggering a statistics update:

SELECT MIN(EventDate), MAX(EventDate)
FROM dbo.Events;

As a slight digression, it’s interesting to look at the execution plan here:

AscendingKey1

You can see two index scans. That sounds horrendous, scan the index twice to find the MIN and MAX? If you look at the properties though you can see it only read one row in each case:

AscendingKey2

An index scan doesn’t have to read all the records in the index, it can bail out once it is satisfied. For a MIN or MAX type query it makes perfect sense just to jump to one end of the index and start scanning.

The side lesson is that Scans aren’t always bad for performance.

Anyway, back to the topic in hand. Now let’s look at the statistics:

DBCC SHOW_STATISTICS('dbo.Events',IX_Events_EventDate);

AscendingKey25

You can see they’re up to date. They show 144,000 rows in total which is correct. Interestingly the Histogram (bottom result-set) only has a couple of steps. SQL has determined that the data is uniformly distributed so has bunched it altogether. Clever stuff!

Let’s insert data for another day:

--Insert one more day's data
DECLARE @StartDate DATETIME;

SELECT @StartDate = MAX(EventDate) FROM dbo.Events;

INSERT INTO dbo.Events(EventName, EventDate)
SELECT
    'Event' + CAST(num.n AS VARCHAR(10)),
    DATEADD(minute,num.n,@StartDate)
FROM 
(
    SELECT TOP 1440 row_number() OVER(ORDER BY (SELECT NULL)) AS n 
    FROM sys.objects a, sys.objects b, sys.objects c
) num;

Now I query to see the new events. I captured the MAX(EventDate) earlier so let’s use that to find the new records:

SELECT *
FROM dbo.Events
WHERE EventDate > '20170411'
OPTION (RECOMPILE);

(Notice I’ve added the option to recompile so I get a fresh plan each time I run this, that will be important for testing)

Let’s not bother with the results, we all know there will 1,440 records that are basically the same. Here’s the execution plan:

AscendingKey3

The interesting bit comes when I look at the properties for the Index Seek:

AscendingKey4

Estimated number of rows = 1, Actual = 1,440. That’s quite a long way out. Of course here we have a trivial query so the massive underestimate isn’t affecting our plan. If we started joining to other tables though it would likely result in a massively inefficient plan – perhaps choosing a Nested Loops join over a Hash or Merge.

Note I’m using SQL Server 2012 for this test and I’m not using the Traceflag (2371) which reduces the threshold for statistics updates ( When do Statistics Get Updated? ):

So I’ve got nearly another 30,000 rows to insert before statistics get automatically updated and my estimates come into line. If I’m always querying for the current day then it’s going to be very rare that statistics are going to be able to help me with a good estimate.

So what’s the fix?

Before we get on to the methods that have been introduced to try and ameliorate this problem, if you face this sort of scenario you might want to consider whether you need to update your statistics objects more often than the auto-stats threshold. If you have a regular job to rebuild fragmented indexes then those indexes that get rebuilt will have their statistics refreshed – however that won’t cover the auto created statistics, and it won’t cover statistics for tables that get don’t get rebuilt.

So, if you don’t have a specific scheduled job to regularly update statistics that is definitely worth considering.

In terms of how SQL has changed to help us, from SQL Server 2005 SP1, the nature of columns began to be tracked, monitoring updates of statistics to understand how the data changes. This additional information can be seen if you enable traceflag 2388, then view the statistics. Let’s have a look at what’s gathered. First I’m going to add a couple more days of data, updating the statistics between each insert, then I run the following:

DBCC TRACEON(2388);
DBCC SHOW_STATISTICS('dbo.Events',IX_Events_EventDate);
DBCC TRACEOFF(2388);

What you see here is historical information about the updates to the statistics. This is undocumented stuff, but some of what we see we can work out the meaning for. In particular we can see how many rows were inserted since the last statistics update, and how many of those values were above the top of the old histogram. We also see a column “Leading Column Type” which has a value of “Unknown”.

Now I’m going to insert another day’s date and update the statistics once more, then we’ll look at this again:

AscendingKey6

You can see that now we have a Leading Column Type of “Ascending”. After three updates to the statistics where the Leading Value was only increasing each time, SQL Server will identify that it is an ascending column. It must be at least three updates before SQL will recognise this, and if that stops being the case (i.e. some lower values are inserted) then the next statistics update will reset this until we again get three consecutive updates with only increasing values.

This happens in the background and you don’t need the traceflag 2388 to make it happen –that is just so you can see what is going on.

The obvious question is, now SQL knows my column is ascending, has that affected the estimation for my query? Before we look I’ll insert another day of data so there is some data beyond the histogram, and then I’ll query again:

SELECT *
FROM dbo.Events
WHERE EventDate > '20170415'
OPTION (RECOMPILE);

And the properties from the execution plan:

AscendingKey7

So nope. Nothing has changed.

Yet…

To tell the query optimizer to take advantage of this extra information for ascending keys we have traceflag 2389. Let’s enable that and run the query again:

DBCC TRACEON(2389);

SELECT *
FROM dbo.Events
WHERE EventDate > '20170415'
OPTION (RECOMPILE);

The properties:

AscendingKey8

Voila! SQL Server has now estimated my rowcount perfectly.

Now, be warned. This is a rather contrived example with a perfectly and artificially smooth distribution of data. The estimate is made by checking the current maximum value in the table, and using that combined with the information existing in the statistics and the value of your predicate to extrapolate a guess. If you’re data is evenly distributed as it is here then the guess will be pretty good, if it is fairly skewed then it may be bad.

In any case though it will probably be better that the fixed value of 1 that would have been used historically.

One thing to note is that traceflag 2389 is only going to have any affect if the leading column of the relevant statistics object has been marked as ascending. There is also traceflag 2390, and this will adopt a similar behaviour even if your column hasn’t been identified as ascending, i.e. it will check the maximum value in the table and if it is higher than the max value in the histogram, it will extrapolate to work out the cardinality estimate.

So should you turn on the traceflag(s) globally?

The Microsoft recommendation is not to enable traceflags such as these unless you are suffering from the specific problem they are aiming to resolve, and even then to make sure you test carefully to ensure they are achieving what you desire.

One issue can be that in more complex queries there are a number of cardinality estimates being made. It can be that two bad estimates within the same plan might cancel each other out and the query overall performs fine. If you then implement something that fixes one of them, you risk such queries going bad – a scenario known as plan regression.

This sort of scenario is one of the reasons why Microsoft have made very few core changes to the cardinality estimator since it came out.

So, use 2389 is you are specifically encountering this sort of ascending key problem, but also, if you are in the position to change the code then you might want to consider adding it as a query hint so it only affects the specific query you are targeting. For our example query above, that would simply look like:

SELECT *
FROM dbo.Events
WHERE EventDate > '20170415'
OPTION (RECOMPILE, QUERYTRACEON 2389);

Welcome to SQL Server 2014 (and later)

In 2014 we received a substantial revamp of the Cardinality Estimator, the first since SQL Server 7.0. A bunch of assumptions and algorithms have been re-jigged based on the real-world data that Microsoft have seen in supporting their wide customer base.

Key to having a new version was the concept that, henceforth, optimizer fixes would be tied to the database compatibility version. This means that customers can upgrade their SQL Server version but if they find performance problems related to the upgrade they can downgrade their database’s compatibility level while the issues are resolved within their codebase.

One of the items specifically looked at in the new version was this Ascending Key problem. To see how things work in the latest versions I’m going to repeat many of the steps above using a database deployed on SQL Server 2016.

So I:

Create my table again
Populate with the 100 days data
Run a query to check the dates, which has the added benefit of updating statistics
Add one more day’s data

Then I’m ready to run my test query again:

SELECT *
FROM dbo.Events
WHERE EventDate > '20170411'
OPTION (RECOMPILE);

I get the same execution plan as ever so again I jump to the properties of the Index Seek operator to look at the estimates:

AscendingKey9

Now, this is interesting. I might have expected I would get either 1 row estimated (the old model) or 1,440 (the model with traceflag 2389). Instead I get 432 rows. It seems the new CE (Cardinality Estimator) uses a different algorithm.

Sometimes numbers stick in your head. I happen to know that where no statistics are available and you are querying with an inequality predicate (<, > , <=, >=) that the CE will estimate the number of rows to be 30% of the total number of rows in the table. This assumption seems to have originated in a 1979 research paper from IBM suggesting 1/3 was a good guess.

With 30% in my head I noticed that 432 is 30% of 1440. So it seems that the optimizer is recognising that we are querying for values above the histogram (where no statistics exist) with an inequality, it knows from somewhere that there have been 1440 rows inserted since the last statistics update, so it takes 30% of 1440 to produce the estimate (432).

To try validate that theory I thought I’d query with a later datetime in the predicate. Sure enough, if I add 12 hours I still get 432 rows estimated. If I add 23 hours, 432 rows. In fact if I query for any date in the future, even outside of the maximum value in the table, guess what – I get an estimate of 432 rows.

I have a fascination for the algorithms involved in distribution statistics. It satisfies the maths geek in me. As such it’s difficult to end a post like this, there’s always more things to test, to try and work out. For instance what happens if you query across an interval that starts within the current histogram, but then extends above it? I’ll admit I’ve had a play, but will leave that for another post.

As a very final point in this post, I thought I’d just check whether the 2389 traceflag makes any difference to this estimation with the 2014 CE. I’ll change my query to look way into the future, enable the traceflag and look at the estimate:

SELECT *
FROM dbo.Events
WHERE EventDate > '99991231 23:59:59'
OPTION (RECOMPILE, QUERYTRACEON 2389);

Guess what? Still 432 rows… so no, the traceflag doesn’t appear to still give us any

Got a problem or embarking on a SQL Server project and want some help and advice? I’m available for consulting – please get in touch or check out my services page to find out what I can do for you.extra benefit.
Though when we get to that date it’ll be someone else’s problem to sort out!

What are Statistics in SQL Server?

February 1, 2021May 10, 2023 ~ Matthew McGiffen ~ Leave a comment

Statistics are vitally important in allowing SQL Server to find the most efficient way to execute your queries. In this post we learn more about them, what they are and how they are used.

Cardinality Estimation

Cardinality is a term originally from Mathematics, generally defined as “The number of objects in a given set or grouping”. In SQL we’re continually dealing with sets so this becomes a very relevant topic, which in our context is just the “number of rows”.

When you have a query across multiple tables there any many ways in which SQL Server could decide to physically go about getting you the results you want. It could query and join the tables in any order and it could use different methods for matching records from the various tables you have joined together. It also needs to know how much memory to allocate for the operation – and to do that it needs to have an idea of the amount of data generated at each stage of processing.

A lot of this requires cardinality estimation, and SQL Server uses something called Statistics objects to perform that calculation.

Let’s look at a simple example:SELECT * FROM Person.Person p INNER JOIN Person.[Address] a ON p.AddressId = a.AddressId WHERE p.LastName = 'Smith' AND a.City = 'Bristol'

When it comes to gathering the results for this query there are a number of ways the database engine could go about it. For instance:

a) It could find all the records in the Person table with a LastName of Smith, look each of their addresses up and return only the ones who live in Bristol.
b) It could find all the Addresses in Bristol, look up the people associated with each address and return only the ones called Smith.
c) It could grab the set of people called Smith from the People table, grab all the addresses in Bristol, and only finally match up the records between those two sets.

Which of those operations is going to be most efficient depends very much on the number of records returned from each table. Let’s say that we have a million people called Smith, but there’s only one address in our whole database that is in Bristol (and let’s say that address does actually belong to someone called Smith).

In the first method above I would grab the million Smiths and then look their address up one by one in the address table until I found the one that lived in Bristol.

If I used method b) though, I would find the one matching address first, and then I would simply look up the owner of that address. Clearly in this rather contrived example, that’s going to be a lot quicker. So if SQL knows ahead of time roughly how many records to expect from each part of the query, hopefully it can make a good decision about how to get the data.

But how can it work out how many rows will be returned without actually running the query?

Statistics

That’s where statistics objects come in. SQL Server maintains in the background data that equates to a histogram showing the distribution of the data in certain columns within a table. It does this any time you create an index – statistics will be generated on the columns the index is defined against, but it also does it any time it determines that it would be useful. So if SQL encounters a Where clause on Person.LastName – and that column isn’t involved in a useful index, SQL is likely to generate a statistics object to tell it about the distribution of data in that column.

I say “likely to” because it actually depends on the settings of your SQL instance. Server configuration is beyond the scope of this post but suffice to say you can let SQL automatically create Statistics objects – or not. You can let it automatically update them when the data has changed by more than a given threshold – or not. And you can specify whether updates to statistics should happen asynchronously or synchronously – i.e. in the latter case if your query determines that statistics needs updating then it will kick that off and wait until the update is complete before processing the query.

It’s generally recommended that auto creation and updating is on, and async updating is off.

Viewing Statistics Objects
Let’s have a look at some actual statistics and see what they hold. There are a couple of ways of doing this, the first is through SSMS. If you look under a table in the object browser you will see a Statistics folder which holds any statistics objects relating to that table:

Statistics1

In the above example you can see some that have friendly names, these are Statistics that are related to an actual index that has been defined on the table and they have the same name as the Index – e.g. IX_Address_StateProvinceId.

You’ll also see we have some prefixed _WA_Sys and then with some random numbers following. These are statistics objects that SQL has created automatically on columns that aren’t indexed – or at least they weren’t indexed at the time the Statistics objects were created.

You can open these up with a double-click and see what’s inside:

Statistics2

This is the General tab you open up to. You’ll see it tells you what table the Statistics are for and what column(s). There are options for adding columns and changing the order – but you never really need to do this – as well as information to tell you when the statistics were last updated, and a check box if you want to update them now.

In the details tab there’s a lot more info:

Statistics3

I don’t find this the easiest display format to work with though, so rather than delving into what everything means here let’s look at the other way you can view statistics, which is by running the following command:

DBCC SHOW_STATISTICS('Person.Address', '_WA_Sys_00000004_164452B1')

The format is straightforward, you just specify the table you are interested in the Statistics for, and the actual name of the Statistics object you want. You can see the same information as if you double-clicked on it, but the results are output in the results pane like any other query and are (I think) a lot easier to read. Allegedly there will soon be a third way in SQL Server to view Statistics as DBCC commands are considered a bit “clunky” – but we don’t know what that will look like yet.

The command outputs three resultsets:

Statistics4

This post is just an introduction to statistics – and generally you don’t need to know that much, it’s just handy to understand the basics. So let’s just run over the key bits of information you can see above:

First of all in the first recordset – otherwise know as the…

Stats Header

Rows – is the number of rows in your table

Rows Sampled – this is how many rows were sampled to generate the statistics. SQL can generate or update statsitics using sampling rather than reading all the rows. In this case you’ll see it did actually read the whole table.

Steps – If you imagine the statistics as a bar chart – this is the number of bars on the chart. Statistics objects have a maximum of 200 steps so if you have more distinct values in your column than that they will be grouped into steps.

Density – This is supposed to be the probability of a row having a particular value (calculated as 1 / Number of Distinct values in column). According to books online “This Density value is not used by the query optimizer and is displayed for backward compatibility with versions before SQL Server 2008.” I am using SQL 2012, and this number is just plain incorrect so don’t use it…

Recordset Number 2: The Density Vector

All Density – this is the accurate version of the Density statistic described above. So your probability of a given row having a specific value is about 0.0017. That’s a little less than one in 500. I happen to know there are 575 different Cities in the table so that makes sense. Sometimes SQL will use this value to form a plan – if it knows you’re going to search this table for a specific City and it doesn’t know that City when it makes the plan, then it could guess that about 1/500th of the rows will match your criteria.

Average Length – Is what it says on the can. The average length of data in this column.

Columns – The names of any column measured in this statistics objects. You can have statistics across multiple columns but I’m not going to cover that in this post. In this case it tells us these statistics are based on the “City” column.

Recordset Number 3: The Histogram

This last recordset shows the distribution of the data, and is what you could effectively use to to draw a graph of the relative frequencies of different groups of values. Each row represents a step – or bar on the bar chart – and as mentioned above there can be a maximum of 200 steps. So you can see, a statistics object is quite lightweight, even for a massive table.

RANGE_HI_KEY – This upper limit of each step, so each step contains all the values bigger than the RANGE_HI_KEY of the last step, right up to and including this value.

RANGE_ROWS – This is how many rows in the table fall in this range – not including the number that match the RANGE_HI_KEY itself.

EQ_ROWS – The number of rows equal to the HI_KEY

DISTINCT_RANGE_ROWS – The number of different values in the range that there is data for (excluding the HI_KEY).

AVERAGE_RANGE_ROWS – The average number of rows for a given value within the range.

That’s a whistle-stop tour of the Statistics SQL Server holds on your data.

The algorithms that SQL then uses to calculate the number of rows for a given part of your query are pretty transparent when there’s just one column involved. If we look at the above example and let’s say you wanted to look up the rows where the City is “Abingdon” – the statistics tell us there is 1 matching row and that’s the figure SQL will use for cardinality estimation. Where a value is within a range then it will use a calculation based on the AVERAGE_RANGE_ROWS.

When there’s multiple columns involved it’s more complicated there are various algorithms and assumptions that come into play. If you’re interested in digging deeper, one very good resource is the Whitepaper on the 2014 Cardinality Estimator written by Joe Sack: https://www.sqlskills.com/blogs/joe/optimizing-your-query-plans-with-the-sql-server-2014-cardinality-estimator/

Conclusions

The main takeaway from this should just be the understand the extent – and limitations – of the information about the distribution of your data that SQL holds in the background.

If, when you’re tuning queries, you notice that the estimated row counts don’t match the actual, then this could be encouraging SQL to form a bad plan for the query. In these cases you might want to investigate what’s going on with the statistics.

Maybe your query is written in a way that it can’t use statistics effectively, one example of this can be where you store constant values in variables, then query using that variable in a WHERE clause. SQL will then optimise based on the average, rather than on your actual value.

Maybe the plan is based on one data value that has a very different cardinality to the one currently being queried. For instance when you first run a stored procedure, the plan is formed based on the parameters passed. Those parameters could have a cardinality that is quite different to those used in later executions.

Maybe the statistics are out of date and need refreshing. Stats get updated when approximately 20% of the data in the table changes, for a large table this can be a big threshold, so current statistics may not always hold good information about the data you are querying for.

Or maybe SQL is doing as good a job as it can with the information it has at its disposal. You might need to work out how you can give it a little extra help.

Here are some other recent posts about Statistics you may find useful:

When do Statistics get updated?

Rowcount estimates when there are no Statistics

Matthew McGiffen Data

Month: February 2021