Automatic Sample Sizes for Statistics Updates

I mentioned in my previous post about manually updating statistics that you can specify whether they’re updated using a full scan, or you can specify an amount of data to sample, either a percentage of the table size, or a fixed number of rows. You can also choose not to specify this, and SQL Server will decide for you whether to do a full scan, or to sample a certain amount of data.

I thought it would be interesting to look at what the sample sizes are that SQL will choose to use, depending on the amount of data in your table. Note that this occurs if you update statistics without specifying how they should be sampled as below:

UPDATE STATISTICS dbo.Test _WA_Sys_00000002_3AD6B8E2;

This is also the behaviour you will get when SQL updates statistics through the auto-stats mechanism. The fact that auto-stats may sample at a lower rate than is optimal for a given table and the queries against it is another reason you may choose to perform manual statistics updates.

To test this, I created a table and progressively pumped data in. Then after I inserted each batch of rows:

  • I Ran a stats update capturing the CPU time taken
  • Checked the statistics to see what sample size was used in the update
  • Checked the size of the index

Here’s some of the code I used for the test:

--Drop table if exists
IF (OBJECT_ID('dbo.Test')) IS NOT NULL DROP TABLE dbo.Test;

--Create table for Testing
CREATE TABLE dbo.Test(
   Id INT IDENTITY(1,1) CONSTRAINT PK_Test PRIMARY KEY CLUSTERED, 
   TextValue VARCHAR(20) NULL
);

--Create index on TextValue
CREATE INDEX IX_Test_TextValue ON dbo.Test(TextValue);

--Insert a bunch of rows
INSERT INTO dbo.Test(TEXTValue) 
SELECT TOP 100000 'blah'
FROM sys.objects a, sys.objects b, sys.objects c, sys.objects d;  

--Update statistics without specifying how many rows to sample
SET STATISTICS TIME ON;
UPDATE STATISTICS dbo.Test IX_Test_TextValue;
SET STATISTICS TIME OFF;

--View the statistics
DBCC SHOW_STATISTICS('dbo.Test', IX_Test_TextValue) WITH STAT_HEADER;

--Check the size of the index
SELECT
    i.name                  AS IndexName,
    SUM(s.used_page_count)   AS Pages
FROM sys.dm_db_partition_stats  AS s 
JOIN sys.indexes                AS i
ON s.[object_id] = i.[object_id] AND s.index_id = i.index_id
WHERE i.name = 'IX_Test_TextValue'
GROUP BY i.name

 

The results of my testing are shown in the below table:

StatisticsSamplingRates

You can see that we have a full sample being taken for the statistics updates up to 4000,000 records (896 pages) but that once the table size hits 500,000 sampling is happening. If you look at the number of pages you will see we now have over 1,000 pages, 1000 pages being about 8MB of data, which is the threshold that sampling kicks in.

I wasn’t able to find a nice neat formula to determine the sampling rate based on table size, but if we look at the above figures there are still some insights to be gained. The main one is that you’ll notice that even as we double the table size, the number of rows sampled doesn’t go up by much. For instance from 500,000 to a million rows, only 10,000 more rows are sampled. This also means that even for pretty large tables, the update isn’t taking long – another reason why it’s worth leaving auto stats updates enabled and running synchronously with queries – they’re generally not going to take that long.

Another insight is that the percentage of rows sampled drops off very quickly. As the sample size doesn’t really increase that much even when the table size doubles – the percentage sampled has almost halved each time.

8 thoughts on “Automatic Sample Sizes for Statistics Updates

  1. This is great! I have been pushing this idea for a long time now with no traction, but this is the kind of systematic approach we need to take. Thank you 🙂

  2. We just saw that SQL Server 2016 SP1 CU4 adds an enhancement to lock the sampling rate. This could be useful if you have a process to set the sample rate based on size during maintenance. Norecompute can help prevent low sampling during automatic updates.

    This update introduces a new keyword PERSIST_SAMPLE_PERCENT = { ON | OFF } that’s an extension to the existing CREATE/UPDATE STATISTICS Transact-SQL. The keyword will indicate SQL Server to lock the sampling rate (provided when it’s created or updated) for future use.

Leave a Reply