In this post we look at a method using Extended Events (XE) to identify what parent objects are calling a given SQL function and how often.
The background is that I was working with a team where we identified that a certain scalar function was being executed billions of time a day and – although lightweight for a single execution – overall it was consuming significant CPU on the server. We discussed a way of improving things but it required changing the code that called it. The problem was that the function was used in about 700 different places across the database code – both in stored procedures and views – though the views themselves would then be referenced by other stored procedures. Rather than update all the code they’d like to target the objects first that execute the function the most times.
The method I show here will identify the root calling object in each case and calculate the number of executions of the function that are instigated. It comes with a few caveats, I have to capture an event for all calls made on the database – as I don’t know which ones will result in a call to the function – that means there’s likely to be CPU overhead and if you have a large amount of calls occurring there could be a fair amount of disk consumption. In my test the Extended Events (XE) session consumed about 130MB of disk space for a million calls captured. In my case I’m talking billions of calls a day so that will be hundreds of GB. As such I plan to only run it for a small interval, starting with a minute to see what I get and will prefer to run it in a test environment with a representative load.
The method is going to use the module_start extended event and the causality tracking feature that allows you to track how events are related to each other. Let’s go ahead and create the session. First I click on Extended Events Sessions in SSMS then right-click and select “New Session”.
In the first page that comes up I just give the session a name and select to turn Causality Tracking on:
Then I click on events and select the event I want. In this case it is just the module_start event and I’m not applying any filters:
Final I configure my data storage. For this example I’ve just added a ring_buffer (memory) and an event_file allowing up to 10G of storage. When I run this for real I may need to allow quite a bit more storage.
Now we’re done and I can just click OK.
Then I can go the the XE session in SSMS, right-click and select “Start Session”.
In my test database I’m now going to run a few stored procedures that execute my function a number of times. The idea is that I’ll then be able to use the XE data to work out which stored procedure called it the most and that therefore (in a real scenario) I might want to focus on fixing first.
Having run my tests I can then stop the XE session and export the results to a table. I show the method for that in a previous post:
I’ve exported the results into a new table called FunctionExecution in the same database. I show the structure of the table you get here:
The columns I’m interested in for this analysis are object_name (the name of the code object), attach_activity_id.guid (this comes from the causality tracking and will be the same for all objects in the same call stack) and attach_activity_id.seq (which shows the order in which objects were called).
In order to make the analysis quicker I also want to add couple of indexes. The object_name column is an nvarchar(max) which won’t support indexing, so first I create a new column with a better datatype and copy the values over:
ALTER TABLE dbo.FunctionExecution ADD ObjectName sysname; UPDATE dbo.FunctionExecution SET ObjectName = CAST([object_name] AS sysname);
Then I create my indexes:
CREATE INDEX IX_ObjectName ON dbo.FunctionExecution(ObjectName) INCLUDE ([attach_activity_id.guid]); CREATE INDEX IX_GUID ON dbo.FunctionExecution([attach_activity_id.guid]) INCLUDE (ObjectName,[attach_activity_id.seq]);
Finally I am ready to run a query to get the results. I search for the name of the function I am interested in, use attach_activity_id.guid to find other calls in the same call stack, then use attach_activity_id.seq to find the top parent (with a value of 1):
SELECT parent.ObjectName, COUNT(*) AS NumberOfExecutionsOfFunction FROM dbo.FunctionExecution func INNER JOIN dbo.FunctionExecution parent ON func.[attach_activity_id.guid] = parent. [attach_activity_id.guid] WHERE func.ObjectName = 'NonInlinable' AND parent.[attach_activity_id.seq] = 1 GROUP BY parent.ObjectName;
Here are the results against the test procedures I executed while the XE session was running:
For this example we can see that I have stored procedure called ProcedureCallingFunctionALotOfTimes that is responsible for most of the calls to my function – so in theory I’d focus my development efforts there first.
Use this method with caution though, as mentioned at the start there may be a lot of overhead in capturing all these events of your server – be it on your own head if that causes problems!
Got a problem or embarking on a SQL Server project and want some help and advice? I’m available for consulting – please get in touch or check out my services page to find out what I can do for you.
I had a request regarding how to identify unused indexes across all databases on a SQL Server instance (rather than just the current one). I’ve written this script before so thought I’d post it up here on my blog for the future use of myself and others.
The script uses the undocumented sp_MSforeachdb internal stored procedure to iterate through all the databases (excluding the system ones) and store the results of a standard unused indexes script into a temp table. The table can then be queried to analyse the results.
This script comes with a caveat, that the sp_MSforeachdb procedure can be unreliable and on occasion skip databases so you should check that the results include all the databases you are interested in. A simple count of distinct database names, and comparing that against the number of databases on your instance should be sufficient.
If you are trying to analyse in the case where you have multiple instances of the same application database, e.g. one for each client, you can query the temp table to identify indexes that are unused for all clients.
Anyway here’s the script:
--Find unused indexes across all user databases IF OBJECT_ID('tempdb..#UnusedIndexes') IS NOT NULL DROP TABLE #UnusedIndexes; SELECT TOP 0 DB_NAME() AS DatabaseName, s.name AS SchemaName, o.name AS TableName, i.name AS IndexName, iu.user_updates as IndexUpdates, iu.user_lookups as UserLookups, iu.user_seeks AS UserSeeks, iu.user_scans as UserScans INTO #UnusedIndexes FROM sys.dm_db_index_usage_stats iu INNER JOIN sys.objects o ON iu.object_id = o.object_id INNER JOIN sys.schemas s ON o.schema_id = s.schema_id INNER JOIN sys.indexes i ON o.object_id = i.object_id AND i.index_id = iu.index_id WHERE i.is_primary_key = 0 AND i.is_unique = 0 AND iu.user_lookups = 0 AND iu.user_scans = 0 AND iu.user_seeks = 0; EXEC sp_MSforeachdb ' USE [?] IF DB_ID() > 4 BEGIN INSERT INTO #UnusedIndexes SELECT DB_NAME() AS DatabaseName, s.name AS SchemaName, o.name AS TableName, i.name AS IndexName, iu.user_updates as IndexUpdates, iu.user_lookups as UserLookups, iu.user_seeks AS UserSeeks, iu.user_scans as UserScans FROM sys.dm_db_index_usage_stats iu INNER JOIN sys.objects o ON iu.object_id = o.object_id INNER JOIN sys.schemas s ON o.schema_id = s.schema_id INNER JOIN sys.indexes i ON o.object_id = i.object_id AND i.index_id = iu.index_id WHERE i.is_primary_key = 0 AND i.is_unique = 0 AND iu.user_lookups = 0 AND iu.user_scans = 0 AND iu.user_seeks = 0; END' SELECT * FROM #UnusedIndexes; --DROP #UnusedIndexes;
Note: Partially in response to Jeff’s comment below. Proceed with caution before dropping any indexes the script identifies. Stats are only kelp since the last restart, so if that was recent then you may have required indexes that get listed. Also think about whether you might have any occasional processes, such as month, quarter, or year-end processing that will not have executed since that last restart and may require indexes.
Got a problem or embarking on a SQL Server project and want some help and advice? I’m available for consulting – please get in touch or check out my services page to find out what I can do for you.
Another neat little feature in SQL Server 2022 is Query Store Hints. This is the ability to apply a query hint through Query Store rather than having to modify existing code or fiddle around with plan guides.
Query hints are a way to influence optimizer behaviour towards generating desired execution plans for a given query. The word “hint” is a bit of a misnomer as usually they mandate what you wish to happen.
I’m not a big fan of leaping straight away to use query hints as they force the behaviour rather than letting the optimizer work out the best plan for your query. This means you may implement hints that you see as desirable for the current issue you are trying to fix, but they create problems for different executions of the query, or if the data profile changes. In general it’s better to see if you can rewrite the SQL code to achieve what you want, but sometimes that’s not possible and a hint ends up being the best way to go – or at least a good short term solution.
The great thing about applying hints through query store is that it can be done without altering application code. That means the DBA can apply a temporary hint while waiting for code to be fixed – or in extremis use them against third party applications where there is not the ability to get code changed.
Applying a hint is simple as long as the query has a plan stored in query store. First you need to obtain the query_id for the query. One method for doing that is to use the following query I got from this stackexchange post:
SELECT qsq.query_id, qsq.last_execution_time, qsqt.query_sql_text FROM sys.query_store_query qsq INNER JOIN sys.query_store_query_text qsqt ON qsq.query_text_id = qsqt.query_text_id WHERE qsqt.query_sql_text LIKE '%your query text%';
Armed with the query_id, you can then apply the hint using the sys.sp_query_store_set_hints stored procedure. For example, if my query_id was 100 and I want to apply the RECOMPILE hint then I could do that with this SQL:
And that’s it – it’s that simple. Thanks for reading!
Got a problem or embarking on a SQL Server project and want some help and advice? I’m available for consulting – please get in touch or check out my services page to find out what I can do for you.
I’ve been reading Bob Ward’s book “SQL Server 2022 Revealed” and one of the new features in Intelligent Query Processing (IQP) jumped out at me. Not because it’s going to rock my world but rather it reminded me of an idea I had when I first started playing with Query Store when it came out.
I thought it would be cool if the execution plans cached in Query Store could be used to populate the plan cache after an instance restart, removing the need for all queries to be recompiled. I even raised a suggestion on Microsoft Connect for it to be considered as a new feature.
I won’t claim that anyone at Microsoft saw my post and thought “What a great idea!”, but by coincidence they have implemented in SQL 2022 something similar that works in specific scenarios – Optimized Plan Forcing.
The underlying idea behind it is to store – in query store – something called a compilation script that can be used to shorten significantly the compile time for a given query. Any query that runs when query store is enabled will get a compilation script stored if it has a significant compile time (the actual threshold is not disclosed). We can see can see which queries have compilation scripts by looking at the has_compile_replay_script in the sys.query_store_plan view.
If the query plan is forced in Query Store then the next time it needs to be compiled the optimization script will be used to build the plan instead of going through the usual query optimizer , greatly reducing the compile time.
Of course this only works for forced plans, but that makes sense as for other queries we often want a recompile to come up with a new and hopefully better plan. Still it’s an interesting feature. Bob talks about how it was designed in response to a pattern of performance problems they were calling a compile storm. This would be where the plan cache was emptied and then a lot of queries would have to be compiled concurrently causing CPU spikes. By shortening the compile time for some queries – and of course we DBAs can use Query Store to force as many plans as we like (with caution) it’s hoped to reduce the occurrence of such events.
I’m not going to go into trying to demonstrate this in action, but Bob has a full demo in his book which you can buy here:
The feature is turned on by default for databases on SQL Server 2022 as long as they have query store enabled – even if they’re not on the latest compatibility level. You can however turn it off for a given database using:
ALTER DATABASE SCOPED CONFIGURATION SET OPTIMIZED_PLAN_FORCING = ON;
You can also turn it off at the level of an individual query. Either using the query hint DISABLE_OPTIMIZED_PLAN_FORCING, or if you force the plan using the sp.query_store_force_plan system stored procedure then you can disable it using the @disable_optimized_plan_forcing parameter.
Got a problem or embarking on a SQL Server project and want some help and advice? I’m available for consulting – please get in touch or check out my services page to find out what I can do for you.
Statistics are vital in allowing SQL Server to execute your queries in the most performant manner. Having a deeper understanding of how Statistics work really helps when you are performance tuning
I talked a few posts ago about Automatic Sample Sizes for Statistics Updates. From SQL 2016 CU4 we’ve been able to override that. You can manually update a statistics object specifying a given sample rate, and specify that that sample rate is what will be used for any subsequent automatic updates on the same object.
You can’t change the default sample rate globally, but this functionality is actually more useful. It means that if you find that you have a particular statistics object where the automatic sampling rate is too low, and that has a negative effect on query performance, then you can choose a sampling rate specific for that object.
Let’s take a quick look at doing this in practice.
Here’s some reused SQL from the previous post to create a table and throw in a million rows:
--Create table for Testing
CREATE TABLE dbo.Test(
Id INT IDENTITY(1,1) CONSTRAINT PK_Test PRIMARY KEY CLUSTERED,
TextValue VARCHAR(20) NULL
);
--Create index on TextValue
CREATE INDEX IX_Test_TextValue ON dbo.Test(TextValue);
--Insert a bunch of rows
INSERT INTO dbo.Test(TextValue)
SELECT TOP 1000000 'blah'
FROM sys.objects a, sys.objects b, sys.objects c, sys.objects d;
I haven’t updated the stats yet. As we can see if I view them:
--View the statistics
DBCC SHOW_STATISTICS('dbo.Test', IX_Test_TextValue) WITH STAT_HEADER;
The stats haven’t any information yet but there is still one thing to notice. See that final column Persisted Sample Percent? That was added to support this functionality.
The stats will only get updated by the auto stats mechanism when I run a query against the table and the optimiser is interested in them but sees they’re stale. From zero to a million rows should be stale enough. Let’s run a quick query:
--Random Query
SELECT *
FROM dbo.Test
WHERE TextValue = 'not blah’;'
And view the stats again:
We can see the stats got updated and the table was big enough that SQL decided to sample rather than scan the whole index. About 42% of the rows got sampled. Persisted Sample Percent is set to zero, i.e. it is not set.
Let’s say that I want to make sure that this statistics object always updates with a full scan.
What I can now do is manually update the statistics and specific that the sample percentage should be persisted:
--Update the stats and persist the sample rate
UPDATE STATISTICS dbo.Test IX_Test_TextValue
WITH FULLSCAN, PERSIST_SAMPLE_PERCENT = ON;
Let’s check that did what we wanted by viewing the stats:
Great, we can see all the rows were sampled and the Persisted Sample Percent is now set to 100.
Now let’s see what happens when another auto update kicks in. I insert another million rows with the same SQL from above. Then I run the select query again. The optimiser notices the statistics are now stale and so it updates them and recompiles the plan for the query. We can see this when we check the stats yet again:
SQL has done exactly what we wanted, it performed the auto stats update in the background and it used the 100% persisted sample rate.
Got a problem or embarking on a SQL Server project and want some help and advice? I’m available for consulting – please get in touch or check out my services page to find out what I can do for you.
Statistics are vital in allowing SQL Server to execute your queries in the most performant manner. Having a deep understanding of how the SQL Server Optimizer interacts with Statistics really helps when you are performance tuning
One thing that can be useful when looking at an execution plan is to understand what statistics objects the optimizer used to come up with the plan. In this post we look at how that can be achieved using the undocumented traceflag 8666 which can be used to save internal debugging informational into the plan XML – including details of the Statistics objects used.
There are actually a few other trace flags that do similar things but this seems the simplest and the one that works across the most versions of SQL Server. It looks like it was introduced in SQL 2008 and works on all versions up to and including SQL 2016 – but it doesn’t work on SQL 2005.
Note that this technique only works for plans generated once the trace flag is on, so you can’t view the additional information for existing plans in your cache:
Here’s a quick example using it against the AdventureWorks2012 database. You need to make sure you’ve selected the option to show the actual plan, then run the query as below:
--Turn the trace flag on
DBCC traceon (8666);
--Run the query
SELECT *
FROM Person.Person
WHERE LastName LIKE 'Smith%';
--Turn the trace flag off again
DBCC traceoff (8666);
This query produces the following plan:
If you right-click over the SELECT operator and select properties you see the following:
You’ll notice an extra field “Internal Debugging Information” which contains a bunch of unformatted XML. Rather than grabbing that string and formatting it to make it readable, you can right-click back on the plan itself and select “Show Execution Plan XML…” to view the same information in a more friendly format.
If you scroll down through the XML generated you will get to the following section:
Each ModTrackingInfo node displays the information about one statistics object that has been referenced. The wszStatName shows us the name of the Statistics object. We have two in this case, _WA_Sys_00000007_693CA210 and IX_Person_LastName_FirstName_MiddleName. The first is an auto-generated one on the LastName column (I happen to know this was generated before I created the index referenced). Interestingly SQL has looked at both – even though they contain pretty much the same info and one isn’t really required.
Why is this useful?
Well, apart from simply the interesting aspect of seeing some of what is going on in the background when SQL decides what plan to use for your query, it is often the case with poor performing queries that there is bad cardinality estimation going on. Sometimes this might be because the statistics aren’t accurate and would benefit from being updated. You can use this technique to see what statistics objects are being used for the estimation (where it is not just obvious) and then you can look at the objects themselves and see whether they might want refreshing, or whether ther may be other issues with your data – such as a skewed distribution – that make esimates difficult.
I see growing opinion in the SQL world that the refreshing of statistics is often more important than rebuilding indexes to reduce fragmentation – the latter operation will also refresh the statistics but is a lot more resource intensive. You also will commonly have statistics on columns that are not indexed, and the threshold for them to be refreshed can be pretty large when you have a lot of data. You can see more information about manually refreshing statistics in my previous posts on the subject:
Got a problem or embarking on a SQL Server project and want some help and advice? I’m available for consulting – please get in touch or check out my services page to find out what I can do for you.
Statistics are vital in allowing SQL Server to execute your queries in the most performant manner. Having a deep understanding of how Statistics work really helps when you are performance tuning
I mentioned in my previous post about manually updating statistics that you can specify whether they’re updated using a full scan, or you can specify an amount of data to sample, either a percentage of the table size, or a fixed number of rows. You can also choose not to specify this, and SQL Server will decide for you whether to do a full scan, or to sample a certain amount of data.
I thought it would be interesting to look at what the sample sizes are that SQL will choose to use, depending on the amount of data in your table. Note that this occurs if you update statistics without specifying how they should be sampled as below:
This is also the behaviour you will get when SQL updates statistics through the auto-stats mechanism. The fact that auto-stats may sample at a lower rate than is optimal for a given table and the queries against it is another reason you may choose to perform manual statistics updates.
To test this, I created a table and progressively pumped data in. Then after I inserted each batch of rows:
I Ran a stats update capturing the CPU time taken
Checked the statistics to see what sample size was used in the update
Checked the size of the index
Here’s some of the code I used for the test:
--Drop table if exists
IF (OBJECT_ID('dbo.Test')) IS NOT NULL DROP TABLE dbo.Test;
--Create table for Testing
CREATE TABLE dbo.Test(
Id INT IDENTITY(1,1) CONSTRAINT PK_Test PRIMARY KEY CLUSTERED,
TextValue VARCHAR(20) NULL
);
--Create index on TextValue
CREATE INDEX IX_Test_TextValue ON dbo.Test(TextValue);
--Insert a bunch of rows
INSERT INTO dbo.Test(TEXTValue)
SELECT TOP 100000 'blah'
FROM sys.objects a, sys.objects b, sys.objects c, sys.objects d;
--Update statistics without specifying how many rows to sample
SET STATISTICS TIME ON;
UPDATE STATISTICS dbo.Test IX_Test_TextValue;
SET STATISTICS TIME OFF;
--View the statistics
DBCC SHOW_STATISTICS('dbo.Test', IX_Test_TextValue) WITH STAT_HEADER;
--Check the size of the index
SELECT
i.name AS IndexName,
SUM(s.used_page_count) AS Pages
FROM sys.dm_db_partition_stats AS s
JOIN sys.indexes AS i
ON s.[object_id] = i.[object_id] AND s.index_id = i.index_id
WHERE i.name = 'IX_Test_TextValue'
GROUP BY i.name
The results of my testing are shown in the below table:
You can see that we have a full sample being taken for the statistics updates up to 4000,000 records (896 pages) but that once the table size hits 500,000 sampling is happening. If you look at the number of pages you will see we now have over 1,000 pages, 1000 pages being about 8MB of data, which is the threshold that sampling kicks in.
I wasn’t able to find a nice neat formula to determine the sampling rate based on table size, but if we look at the above figures there are still some insights to be gained. The main one is that you’ll notice that even as we double the table size, the number of rows sampled doesn’t go up by much. For instance from 500,000 to a million rows, only 10,000 more rows are sampled. This also means that even for pretty large tables, the update isn’t taking long – another reason why it’s worth leaving auto stats updates enabled and running synchronously with queries – they’re generally not going to take that long.
Another insight is that the percentage of rows sampled drops off very quickly. As the sample size doesn’t really increase that much even when the table size doubles – the percentage sampled has almost halved each time.
Got a problem or embarking on a SQL Server project and want some help and advice? I’m available for consulting – please get in touch or check out my services page to find out what I can do for you.
Having up to date statistics is vital for getting the best performance out of your queries. Even though SQL Server automatically updates statistics in the background for you (When do statistics get updated?), you may find there are times when you want to manage updating them yourself.
You may have large tables and find that the interval between the automatic updates is too big and is resulting in sub-optimal query plans.
You might need timely updates to a specific object – maybe as part of an ETL process to make sure that statistics are up to date after a particular part of the process, perhaps after a daily load into a table.
You may find that the automatic updates look at too small a sample size and you need to scan more of the table to enable accurate estimates for your queries.
My previous post on the Ascending Key problem demonstrated a common issue where the first of these scenarios could be affecting you. We’ll look at examples of the other scenarios in subsequent posts.
For now though, let’s just look at how you go about updating statistics.
The other thing you may be likely to want to specify is whether the statistics should be updated using a full scan of the table, or just be looking at a sample of the rows. In the above examples we didn’t specify this so SQL Server will decide for us. In general sampling (as opposed to full scans) kicks in when we have about 8MB of data in the table (or about 1000 pages).
If you want to specify a full scan the syntax is as follows:
UPDATE STATISTICS dbo.Test _WA_Sys_00000002_3AD6B8E2 WITH FULLSCAN;
If you want the statistics update to use sampling (more on how this works in subsequent posts) then you can choose to specify a percentage of the total table to be sampled:
UPDATE STATISTICS dbo.Test _WA_Sys_00000002_3AD6B8E2 WITH SAMPLE 10 PERCENT;
Or you can specify a fixed number of rows:
UPDATE STATISTICS dbo.Test _WA_Sys_00000002_3AD6B8E2 WITH SAMPLE 10000 ROWS;
You might want to use a sample as once your tables get large full scans can take a little time. Equally though if you’re updating statistics in a quieter time (e.g. overnight) you may feel you can afford the extra time for the full scans. Here’s some comparison figures I produced on my local machine showing how long full scans take. Obviously this will change depending on your hardware and other factors including how much of the table is already in memory:
You can see however that the time taken pretty much scales linearly as the table size increases.
One thing to be aware of is parallelism. A full scan can run as a parallel operation on your server which can speed it up considerably. When you update statistics using a sample however this can only run single-threaded unless you’re on SQL Server 2016 (or higher). Sampling data to build statistics in parallel was one of the many excellent little improvements in 2016.
sp_UpdateStats
This is another method you might use for manually updating statistics (perhaps as part of a scheduled maintenance job). This system stored procedure can be used for updating all of the statistics objects in a database:
USE Test;
EXEC sp_UpdateStats;
This stored procedure iterates through your database using a WHILE loop and executes the UPDATE STATISTICS command as it goes. One nifty thing about using this procedure is that it only updates statistics objects where rows have changed, so you don’t have any overhead for refreshing statistics where the underlying data hasn’t been modified. You can see this from this extract of the output of the stored procedure:
Updating [dbo].[TestMemory]
[PK__TestMemo__3214EC070D799003], update is not necessary…
0 index(es)/statistic(s) have been updated, 1 did not require update.
Updating [dbo].[TestMemory2]
[PK__TestMemo__3214EC07D3DC52DE], update is not necessary…
0 index(es)/statistic(s) have been updated, 1 did not require update.
Updating [dbo].[Test]
[PK_Test] has been updated…
[IX_Test_TextValue] has been updated…
2 index(es)/statistic(s) have been updated, 0 did not require update.
Ola Hallengren
Of course, if you’re looking to implement statistics update as part of regular maintenance, then you should definitely be considering using Ola Hallengren’s maintenance solution. Ola maintains a great solution for managing database backups and integrity checks as well index and statistics maintenance, and he shares it with the SQL community for free.
Got a problem or embarking on a SQL Server project and want some help and advice? I’m available for consulting – please get in touch or check out my services page to find out what I can do for you.
Statistics are vitally important in allowing SQL Server to find the most efficient way to execute your queries. In this post we learn more about them, what they are and how they are used.
Cardinality Estimation
Cardinality is a term originally from Mathematics, generally defined as “The number of objects in a given set or grouping”. In SQL we’re continually dealing with sets so this becomes a very relevant topic, which in our context is just the “number of rows”.
When you have a query across multiple tables there any many ways in which SQL Server could decide to physically go about getting you the results you want. It could query and join the tables in any order and it could use different methods for matching records from the various tables you have joined together. It also needs to know how much memory to allocate for the operation – and to do that it needs to have an idea of the amount of data generated at each stage of processing.
A lot of this requires cardinality estimation, and SQL Server uses something called Statistics objects to perform that calculation.
Let’s look at a simple example: SELECT * FROM Person.Person p INNER JOIN Person.[Address] a ON p.AddressId = a.AddressId WHERE p.LastName = 'Smith' AND a.City = 'Bristol'
When it comes to gathering the results for this query there are a number of ways the database engine could go about it. For instance:
a) It could find all the records in the Person table with a LastName of Smith, look each of their addresses up and return only the ones who live in Bristol. b) It could find all the Addresses in Bristol, look up the people associated with each address and return only the ones called Smith. c) It could grab the set of people called Smith from the People table, grab all the addresses in Bristol, and only finally match up the records between those two sets.
Which of those operations is going to be most efficient depends very much on the number of records returned from each table. Let’s say that we have a million people called Smith, but there’s only one address in our whole database that is in Bristol (and let’s say that address does actually belong to someone called Smith).
In the first method above I would grab the million Smiths and then look their address up one by one in the address table until I found the one that lived in Bristol.
If I used method b) though, I would find the one matching address first, and then I would simply look up the owner of that address. Clearly in this rather contrived example, that’s going to be a lot quicker. So if SQL knows ahead of time roughly how many records to expect from each part of the query, hopefully it can make a good decision about how to get the data.
But how can it work out how many rows will be returned without actually running the query?
Statistics
That’s where statistics objects come in. SQL Server maintains in the background data that equates to a histogram showing the distribution of the data in certain columns within a table. It does this any time you create an index – statistics will be generated on the columns the index is defined against, but it also does it any time it determines that it would be useful. So if SQL encounters a Where clause on Person.LastName – and that column isn’t involved in a useful index, SQL is likely to generate a statistics object to tell it about the distribution of data in that column.
I say “likely to” because it actually depends on the settings of your SQL instance. Server configuration is beyond the scope of this post but suffice to say you can let SQL automatically create Statistics objects – or not. You can let it automatically update them when the data has changed by more than a given threshold – or not. And you can specify whether updates to statistics should happen asynchronously or synchronously – i.e. in the latter case if your query determines that statistics needs updating then it will kick that off and wait until the update is complete before processing the query.
It’s generally recommended that auto creation and updating is on, and async updating is off.
Viewing Statistics Objects Let’s have a look at some actual statistics and see what they hold. There are a couple of ways of doing this, the first is through SSMS. If you look under a table in the object browser you will see a Statistics folder which holds any statistics objects relating to that table:
In the above example you can see some that have friendly names, these are Statistics that are related to an actual index that has been defined on the table and they have the same name as the Index – e.g. IX_Address_StateProvinceId.
You’ll also see we have some prefixed _WA_Sys and then with some random numbers following. These are statistics objects that SQL has created automatically on columns that aren’t indexed – or at least they weren’t indexed at the time the Statistics objects were created.
You can open these up with a double-click and see what’s inside:
This is the General tab you open up to. You’ll see it tells you what table the Statistics are for and what column(s). There are options for adding columns and changing the order – but you never really need to do this – as well as information to tell you when the statistics were last updated, and a check box if you want to update them now.
In the details tab there’s a lot more info:
I don’t find this the easiest display format to work with though, so rather than delving into what everything means here let’s look at the other way you can view statistics, which is by running the following command:
The format is straightforward, you just specify the table you are interested in the Statistics for, and the actual name of the Statistics object you want. You can see the same information as if you double-clicked on it, but the results are output in the results pane like any other query and are (I think) a lot easier to read. Allegedly there will soon be a third way in SQL Server to view Statistics as DBCC commands are considered a bit “clunky” – but we don’t know what that will look like yet.
The command outputs three resultsets:
This post is just an introduction to statistics – and generally you don’t need to know that much, it’s just handy to understand the basics. So let’s just run over the key bits of information you can see above:
First of all in the first recordset – otherwise know as the…
Stats Header
Rows – is the number of rows in your table
Rows Sampled – this is how many rows were sampled to generate the statistics. SQL can generate or update statsitics using sampling rather than reading all the rows. In this case you’ll see it did actually read the whole table.
Steps – If you imagine the statistics as a bar chart – this is the number of bars on the chart. Statistics objects have a maximum of 200 steps so if you have more distinct values in your column than that they will be grouped into steps.
Density – This is supposed to be the probability of a row having a particular value (calculated as 1 / Number of Distinct values in column). According to books online “This Density value is not used by the query optimizer and is displayed for backward compatibility with versions before SQL Server 2008.” I am using SQL 2012, and this number is just plain incorrect so don’t use it…
Recordset Number 2: The Density Vector
All Density – this is the accurate version of the Density statistic described above. So your probability of a given row having a specific value is about 0.0017. That’s a little less than one in 500. I happen to know there are 575 different Cities in the table so that makes sense. Sometimes SQL will use this value to form a plan – if it knows you’re going to search this table for a specific City and it doesn’t know that City when it makes the plan, then it could guess that about 1/500th of the rows will match your criteria.
Average Length – Is what it says on the can. The average length of data in this column.
Columns – The names of any column measured in this statistics objects. You can have statistics across multiple columns but I’m not going to cover that in this post. In this case it tells us these statistics are based on the “City” column.
Recordset Number 3: The Histogram
This last recordset shows the distribution of the data, and is what you could effectively use to to draw a graph of the relative frequencies of different groups of values. Each row represents a step – or bar on the bar chart – and as mentioned above there can be a maximum of 200 steps. So you can see, a statistics object is quite lightweight, even for a massive table.
RANGE_HI_KEY – This upper limit of each step, so each step contains all the values bigger than the RANGE_HI_KEY of the last step, right up to and including this value.
RANGE_ROWS – This is how many rows in the table fall in this range – not including the number that match the RANGE_HI_KEY itself.
EQ_ROWS – The number of rows equal to the HI_KEY
DISTINCT_RANGE_ROWS – The number of different values in the range that there is data for (excluding the HI_KEY).
AVERAGE_RANGE_ROWS – The average number of rows for a given value within the range.
That’s a whistle-stop tour of the Statistics SQL Server holds on your data.
The algorithms that SQL then uses to calculate the number of rows for a given part of your query are pretty transparent when there’s just one column involved. If we look at the above example and let’s say you wanted to look up the rows where the City is “Abingdon” – the statistics tell us there is 1 matching row and that’s the figure SQL will use for cardinality estimation. Where a value is within a range then it will use a calculation based on the AVERAGE_RANGE_ROWS.
The main takeaway from this should just be the understand the extent – and limitations – of the information about the distribution of your data that SQL holds in the background.
If, when you’re tuning queries, you notice that the estimated row counts don’t match the actual, then this could be encouraging SQL to form a bad plan for the query. In these cases you might want to investigate what’s going on with the statistics.
Maybe your query is written in a way that it can’t use statistics effectively, one example of this can be where you store constant values in variables, then query using that variable in a WHERE clause. SQL will then optimise based on the average, rather than on your actual value.
Maybe the plan is based on one data value that has a very different cardinality to the one currently being queried. For instance when you first run a stored procedure, the plan is formed based on the parameters passed. Those parameters could have a cardinality that is quite different to those used in later executions.
Maybe the statistics are out of date and need refreshing. Stats get updated when approximately 20% of the data in the table changes, for a large table this can be a big threshold, so current statistics may not always hold good information about the data you are querying for.
Or maybe SQL is doing as good a job as it can with the information it has at its disposal. You might need to work out how you can give it a little extra help.
Here are some other recent posts about Statistics you may find useful:
Got a problem or embarking on a SQL Server project and want some help and advice? I’m available for consulting – please get in touch or check out my services page to find out what I can do for you.
I find this is a question that comes up again and agan. What estimate for the number of rows returned does SQL Server use if you’re selecting from a column where there are no statistics available?
There are a few different algorithms used depending on how you’re querying the table. In this post we’ll look at where we have a predicate looking for a fixed value.
(If you want the short answer, it’s the fourth root of n cubed before SQL 2014 and the square root of n after that – where n is the number of rows)
This scenario can occur if you have AUTO CREATE STATISTICS turned off for your database, which we don’t recommend you do, but which you might choose to do anyway, and if you query a table with a predicate against a column with no index defined against it.
Let’s look at example querying against the AdventureWorks2012 database. I’ll start by looking at SQL Server 2012 then we’ll see how it behaves in later versions.
I’ve taken the following preparatory steps:
Set AUTO CREATE STATISTICS OFF for the database
Remove the Index on the LastName column for the Person.Person table
Removed any ad-hoc statistics that existed against the table
Then I run a simple query, with the Actual Execution Plan turned on:
SELECT *
FROM Person.Person p
WHERE p.LastName = 'Fox';
I only get one result out as there is only one Fox (Ms. Dorothy J.). Let’s look at the execution plan:
A clustered index scan as we might expect as I’ve removed any useful indexes from the table. You’ll notice there is a warning. If we view the tooltip you’ll see SQL warns us about the lack of statistics:
If we look at the estimated and actual row-counts we’ll see how that has affected us:
In the absence of any useful information – it knows the number of rows in the table but that is about it – SQL has estimated that there will be 1680 Foxes in the table. A bit of playing shows that we get the same estimate whatever value we search for.
If I turn AUTO CREATE STATISTICS on and run the query again then SQL generate a Statistics object against the LastName column and comes up with an estimate of 2.3 rows – which is a lot closer.
This matters a lot once you start running more complicated queries. The incorrect estimate is likely to affect the choice of plan that the optimizer makes, and may also affect the amount of memory it requests in order to run the query. Let’s look at a quick example of how the plan changes if we join the above query to another table.
First, without statistics (so I have to turn AUTO CREATE off again, and remove the statistics that got created):
SET STATISTICS IO ON;
SELECT e.EmailAddress
FROM Person.Person p
INNER JOIN Person.EmailAddress e
ON p.BusinessEntityID = e.BusinessEntityID
WHERE LastName = 'Fox';
Here’s the execution plan:
You can see I’ve got a Merge Join as SQL thinks I’m expecting 1680 rows from the top table. A Nested Loops join would generally be better when I only expect one or two rows from that table.
I’ve also captured the IO so I can see how expensive the query was:
There’s not much difference in the overall IO, but you can see the Reads for the EmailAddress table have dropped from 10 to 2 due to the change from the Scan to the Seek. If the table was a lot bigger then we could see a large difference here.
So where does that estimate come from?
I thought I’d have a play and see if I could work out how SQL decided on that estimate of 1680 rows. I did some googling and found a suggestion that it might be a straight 9% of the total number of rows in the table, but that doesn’t quite add up and when I compared the same query pattern against a few tables I found I got a different ratio depending on the amount of rows in the table.
So I pumped rows incrementally into a fresh table and looked at the estimate and what the ratio was as the number of rows increased. Here’s my SQL for that:
--Create a Horrible Heap for my testing
CREATE TABLE dbo.TestStats(TestVal INT NOT NULL, TestText VARCHAR(255) NULL);
--Insert a bunch of rows using dodgy old-style cross joins
INSERT INTO TestStats
SELECT TOP 1 --Amount of rows I'm going to insert
1 , 'blah'
FROM sys.objects a, sys.objects b, sys.objects c, sys.objects d, sys.objects e;
--Clear the plan cache so SQL generated a new estimate
DBCC freeproccache;
--Query the table, then I go check the execution plan generated
--for the estimated number of rows
SELECT * FROM dbo.TestStats
WHERE TestVal = 1;
(One thing to note was that I got the same answers whether I was querying the text column or the integer column – SQL seems to use the same algorithm for both.)
I started to notice a pattern quite quickly, that the ratio halved when the number of rows went up by a factor of 16. I then restarted my test, targeting my row-counts to be where the estimated number of rows would be a nice round number. You can see that in the table below:
I then attempted to work out a formula for that. Rather than take you through the shoddy process of mathematics that led me to an answer, I’ll just tell you that the formula came out as:
Where e is the estimated number of rows for a given predicate value, and n is the total number of rows in the table. I checked that against the full set of results I’d gathered and it held true across all values of n I’d tested.
To check it finally against my original query – the Person.Person table had 19,972 rows. I put that through the calculator with the formula and get 1680.027. If we look back at the original estimate you’ll see that SQL stated 1680.03 – so that is all good.
As I mentioned earlier I was using SQL Server 2012 for this test, and a new Cardinality Estimator came into effect in SQL 2014. So I thought I’d run the test again with SQL 2016 and see if the results changed:
We can see the estimated rows drop off a lot quicker here. Clearly Microsoft have decided to lower the estimate. Actually it is now just the square root of the total number of rows.
Hopefully you’re not in the scenario where you regularly have queries running without the appropriate statistics to support them. The above comparison though shows us that if you have such a query its behaviour could dramatically change if you are on on older version of SQL Server version and you upgrade. It could become better or it could become a lot worse.
There are a lot of changes like this that came in with the new version of the Cardinality Estimator in 2014. Places where underlying assumptions have been adjusted to make better guesses about the number of rows that will be returned by an operator. But they are still guesses based on the same information – there is no new data being captured in the Statistics to better inform the process. Of course Microsoft has made these changes to try and better model data out in the wild – but they are still fixed assumptions, which means sometimes they will be better and sometimes they will be worse.
One thing I should re-iterate is that these formulae we’ve discovered above are for a fairly specific querying pattern. There’s no guarantee that the calculation will be the same for a similar – but different query. It might be interesting to explore that further in a later post.
Also there may be other information in your database – such as constraints – that SQL can use to educate its guesses.
The main takeaway from all of this though, should of course be:
MAKE SURE AUTO CREATE STATISTICS IS TURNED ON FOR YOUR DATABASES!
Got a problem or embarking on a SQL Server project and want some help and advice? I’m available for consulting – please get in touch or check out my services page to find out what I can do for you.