Manually updating Statistics

Having up to date statistics is vital for getting the best performance out of your queries. Even though SQL Server automatically updates statistics in the background for you (When do statistics get updated?), you may find there are times when you want to manage updating them yourself.

  • You may have large tables and find that the interval between the automatic updates is too big and is resulting in sub-optimal query plans.
  • You might need timely updates to a specific object – maybe as part of an ETL process to make sure that statistics are up to date after a particular part of the process, perhaps after a daily load into a table.
  • You may find that the automatic updates look at too small a sample size and you need to scan more of the table to enable accurate estimates for your queries.

My previous post on the Ascending Key problem demonstrated a common issue where the first of these scenarios could be affecting you. We’ll look at examples of the other scenarios in subsequent posts.

For now though, let’s just look at how you go about updating statistics.

UPDATE STATISTICS

At the heart of all the methods we’ll look at is the UPDATE STATISTICS command. There are a lot of options for using this command, but we’ll just focus on the ones you’re most likely to use. For full documentation here is the official reference:
https://docs.microsoft.com/en-us/sql/t-sql/statements/update-statistics-transact-sql

You can use this at the table level in which case all statistics objects belonging to that table will be updated:

UPDATE STATISTICS dbo.Test;

Or you can specify the specific statistics object you want to update:

UPDATE STATISTICS dbo.Test _WA_Sys_00000002_3AD6B8E2;

FULL SCAN or SAMPLE

The other thing you may be likely to want to specify is whether the statistics should be updated using a full scan of the table, or just be looking at a sample of the rows. In the above examples we didn’t specify this so SQL Server will decide for us. In general sampling (as opposed to full scans) kicks in when we have about 8MB of data in the table (or about 1000 pages).

If you want to specify a full scan the syntax is as follows:

UPDATE STATISTICS dbo.Test _WA_Sys_00000002_3AD6B8E2 WITH FULLSCAN;

If you want the statistics update to use sampling (more on how this works in subsequent posts) then you can choose to specify a percentage of the total table to be sampled:

UPDATE STATISTICS dbo.Test _WA_Sys_00000002_3AD6B8E2 WITH SAMPLE 10 PERCENT;

Or you can specify a fixed number of rows:

UPDATE STATISTICS dbo.Test _WA_Sys_00000002_3AD6B8E2 WITH SAMPLE 10000 ROWS;

You might want to use a sample as once your tables get large full scans can take a little time. Equally though if you’re updating statistics in a quieter time (e.g. overnight) you may feel you can afford the extra time for the full scans. Here’s some comparison figures I produced on my local machine showing how long full scans take. Obviously this will change depending on your hardware and other factors including how much of the table is already in memory:

FullScanTimes

You can see however that the time taken pretty much scales linearly as the table size increases.

One thing to be aware of is parallelism. A full scan can run as a parallel operation on your server which can speed it up considerably. When you update statistics using a sample however this can only run single-threaded unless you’re on SQL Server 2016 (or higher). Sampling data to build statistics in parallel was one of the many excellent little improvements in 2016.

sp_UpdateStats

This is another method you might use for manually updating statistics (perhaps as part of a scheduled maintenance job). This system stored procedure can be used for updating all of the statistics objects in a database:

USE Test;
EXEC sp_UpdateStats;

This stored procedure iterates through your database using a WHILE loop and executes the UPDATE STATISTICS command as it goes. One nifty thing about using this procedure is that it only updates statistics objects where rows have changed, so you don’t have any overhead for refreshing statistics where the underlying data hasn’t been modified. You can see this from this extract of the output of the stored procedure:

Updating [dbo].[TestMemory]
[PK__TestMemo__3214EC070D799003], update is not necessary…
0 index(es)/statistic(s) have been updated, 1 did not require update.

Updating [dbo].[TestMemory2]
[PK__TestMemo__3214EC07D3DC52DE], update is not necessary…
0 index(es)/statistic(s) have been updated, 1 did not require update.

Updating [dbo].[Test]
[PK_Test] has been updated…
[IX_Test_TextValue] has been updated…
2 index(es)/statistic(s) have been updated, 0 did not require update.

Ola Hallengren

Of course, if you’re looking to implement statistics update as part of regular maintenance, then you should definitely be considering using Ola Hallengren’s maintenance solution. Ola maintains a great solution for managing database backups and integrity checks as well index and statistics maintenance, and he shares it with the SQL community for free.

You can find full details of Ola’s solution and download it here:
https://ola.hallengren.com/

Book Review: Learn dbatools in a Month of Lunches

If you work in the world of SQL Server you’ve almost certainly heard of dbatools. For those who haven’t, it’s an open source PowerShell module for automating literally hundreds of tasks on your database instances.

What the rest of you may or may not know, is that the creators of dbatools have been working on a book to make it easy to get started with the tool – Learn dbatools in a Month of Lunches

If you’re saying to yourself, it all sounds good, but I don’t really know PowerShell that well, I’m not confident how I work with modules, open source and otherwise, it all seems like a lot to learn for things I can do already another way… then I highly recommend you check out this book.

Written by Chrissy LeMaire and Rob Sewell this book takes you through the steps to get going before leading you through many of the tasks that dbatools can help you with.

Speaking for myself, I’m not a PowerShell afficionado, I’d struggle to write more than a few basic commands without googling the syntax. Following this book though made it really easy to get going, and I learnt a lot of PowerShell fundamentals along the way. It was certainly a lot easier than if I’d had to figure out things for myself, and it felt like I was aquiring a more complete knowledge that will stand me in good stead. Within a very short time of accessing the live book in the browser, I was set up and running commands with dbatools.

The book isn’t complete yet, but is being published through the Manning Early Access Programme, which means you can access chapters as they are published. Each chapter is a nice bite sized chunk you can consume fairly quickly and painlessly. Everything is explained really clearly – and there are comprehensive code samples you can work through.

If you’re a DBA who is aware of all this stuff, not sure you really need it, but have a niggling doubt that you’re getting left behind and probably should get around to learning it at some point, grab the book and start working through a few chapters. You’ll be operating with confidence in no time.

You can find out more about the book and access some preview content through the Manning website:

Learn dbatools in a Month of Lunches

Reminder: SQL Server Agent and the clocks going forward

For most of us in the Northern Hemisphere the clocks go foward next month and we get to look forward to longer, sunnier evenings. This post is a reminder that if you have jobs scheduled in SQL Server Agent for the night that happens you may want to check they will not be impacted by the change.

In the UK our clocks change at 1 AM and jump forward by an hour. That means the hour usually between 1 AM and 2 AM doesn’t exist. Any jobs I have scheduled in that time will not happen.

In other countries you may have the change at a slightly different tme – in the US I believe it happens at 2 AM. If you’re in other countries you may need to check what time that happens if you’re not already aware of it.

If you have jobs that need to happen that would normally be scheduled in the “missing” hour then you need to re-schedule them. You may also want to consider permanently moving them to a different time window so you don’t have to worry about it again.

Or even better, you might want to consider having your servers on UTC so you don’t have this problem.

For what will happen in the Autumn when they go back again you can check out this post:

SQL Server Agent and Daylight Saving time

And that concludes this Public Service Announcement!

Rowcount estimates when there are no Statistics

I find this is a question that comes up again and agan. What estimate for the number of rows returned does SQL Server use if you’re selecting from a column where there are no statistics available?

There are a few different algorithms used depending on how you’re querying the table. In this post we’ll look at where we have a predicate looking for a fixed value.

(If you want the short answer, it’s the fourth root of n cubed before SQL 2014 and the square root of n after that – where n is the number of rows)

This scenario can occur if you have AUTO CREATE STATISTICS turned off for your database, which we don’t recommend you do, but which you might choose to do anyway, and if you query a table with a predicate against a column with no index defined against it.

Let’s look at example querying against the AdventureWorks2012 database. I’ll start by looking at SQL Server 2012 then we’ll see how it behaves in later versions.

I’ve taken the following preparatory steps:

  • Set AUTO CREATE STATISTICS OFF for the database
  • Remove the Index on the LastName column for the Person.Person table
  • Removed any ad-hoc statistics that existed against the table

Then I run a simple query, with the Actual Execution Plan turned on:

SELECT *
FROM Person.Person p
WHERE p.LastName = 'Fox';

I only get one result out as there is only one Fox (Ms. Dorothy J.). Let’s look at the execution plan:

RowCountNoStatistics1

A clustered index scan as we might expect as I’ve removed any useful indexes from the table. You’ll notice there is a warning. If we view the tooltip you’ll see SQL warns us about the lack of statistics:

RowCountNoStatistics2

If we look at the estimated and actual row-counts we’ll see how that has affected us:

RowCountNoStatistics3

In the absence of any useful information – it knows the number of rows in the table but that is about it – SQL has estimated that there will be 1680 Foxes in the table. A bit of playing shows that we get the same estimate whatever value we search for.

If I turn AUTO CREATE STATISTICS on and run the query again then SQL generate a Statistics object against the LastName column and comes up with an estimate of 2.3 rows – which is a lot closer.

This matters a lot once you start running more complicated queries. The incorrect estimate is likely to affect the choice of plan that the optimizer makes, and may also affect the amount of memory it requests in order to run the query. Let’s look at a quick example of how the plan changes if we join the above query to another table.

First, without statistics (so I have to turn AUTO CREATE off again, and remove the statistics that got created):

SET STATISTICS IO ON;

SELECT e.EmailAddress
FROM Person.Person p
INNER JOIN Person.EmailAddress e
    ON p.BusinessEntityID = e.BusinessEntityID
WHERE LastName = 'Fox';

Here’s the execution plan:

RowCountNoStatistics4

You can see I’ve got a Merge Join as SQL thinks I’m expecting 1680 rows from the top table. A Nested Loops join would generally be better when I only expect one or two rows from that table.

I’ve also captured the IO so I can see how expensive the query was:

Table ‘EmailAddress’. Scan count 1, logical reads 10, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table ‘Person’. Scan count 1, logical reads 3818, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Let’s look at the behaviour of the same query with statistics creation enabled:

RowCountNoStatistics5

You can see we now have the desired Nested Loops join and the Clustered Index Scan on the EmailAddress table has been changed to a Seek.

The IO output is below:

Table ‘EmailAddress’. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table ‘Person’. Scan count 1, logical reads 3818, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

There’s not much difference in the overall IO, but you can see the Reads for the EmailAddress table have dropped from 10 to 2 due to the change from the Scan to the Seek. If the table was a lot bigger then we could see a large difference here.

So where does that estimate come from?

I thought I’d have a play and see if I could work out how SQL decided on that estimate of 1680 rows. I did some googling and found a suggestion that it might be a straight 9% of the total number of rows in the table, but that doesn’t quite add up and when I compared the same query pattern against a few tables I found I got a different ratio depending on the amount of rows in the table.

So I pumped rows incrementally into a fresh table and looked at the estimate and what the ratio was as the number of rows increased. Here’s my SQL for that:

--Create a Horrible Heap for my testing  
CREATE TABLE dbo.TestStats(TestVal INT NOT NULL, TestText VARCHAR(255) NULL);

--Insert a bunch of rows using dodgy old-style cross joins
INSERT INTO TestStats
SELECT TOP 1 --Amount of rows I'm going to insert
    1 , 'blah'
FROM sys.objects a, sys.objects b, sys.objects c, sys.objects d, sys.objects e;

--Clear the plan cache so SQL generated a new estimate  
DBCC freeproccache;

--Query the table, then I go check the execution plan generated 
--for the estimated number of rows
SELECT * FROM dbo.TestStats
WHERE TestVal = 1;   

(One thing to note was that I got the same answers whether I was querying the text column or the integer column – SQL seems to use the same algorithm for both.)

I started to notice a pattern quite quickly, that the ratio halved when the number of rows went up by a factor of 16. I then restarted my test, targeting my row-counts to be where the estimated number of rows would be a nice round number. You can see that in the table below:

RowCountNoStatistics6

I then attempted to work out a formula for that. Rather than take you through the shoddy process of mathematics that led me to an answer, I’ll just tell you that the formula came out as:

RowCountNoStatistics6b

Where e is the estimated number of rows for a given predicate value, and n is the total number of rows in the table. I checked that against the full set of results I’d gathered and it held true across all values of n I’d tested.

To check it finally against my original query – the Person.Person table had 19,972 rows. I put that through the calculator with the formula and get 1680.027. If we look back at the original estimate you’ll see that SQL stated 1680.03  – so that is all good.

As I mentioned earlier I was using SQL Server 2012 for this test, and a new Cardinality Estimator came into effect in SQL 2014. So I thought I’d run the test again with SQL 2016 and see if the results changed:

RowCountNoStatistics7

We can see the estimated rows drop off a lot quicker here. Clearly Microsoft have decided to lower the estimate. Actually it is now just the square root of the total number of rows.

RowCountNoStatistics8

Hopefully you’re not in the scenario where you regularly have queries running without the appropriate statistics to support them. The above comparison though shows us that if you have such a query its behaviour could dramatically change if you are on on older version of SQL Server version and you upgrade. It could become better or it could become a lot worse.

There are a lot of changes like this that came in with the new version of the Cardinality Estimator in 2014. Places where underlying assumptions have been adjusted to make better guesses about the number of rows that will be returned by an operator. But they are still guesses based on the same information – there is no new data being captured in the Statistics to better inform the process. Of course Microsoft has made these changes to try and better model data out in the wild – but they are still fixed assumptions, which means sometimes they will be better and sometimes they will be worse.

One thing I should re-iterate is that these formulae we’ve discovered above are for a fairly specific querying pattern. There’s no guarantee that the calculation will be the same for a similar – but different query. It might be interesting to explore that further in a later post.

Also there may be other information in your database – such as constraints – that SQL can use to educate its guesses.

The main takeaway from all of this though, should of course be:

MAKE SURE AUTO CREATE STATISTICS IS TURNED ON FOR YOUR DATABASES!

Introduction to SQL Server Query Store

Introduced with SQL 2016, Query Store was, probably without doubt, the most anticipated and talked out new feature. In this post we’ll just take a brief look at it, what it is, how you set it running, and what you can use it for. This will be a fairly brief overview – you’d need a book to cover it in detail – but hopefully this will give you a flavour of how useful this will be and how to get started.

What it does, at a base level, is actually quite simple. It just stores information relating to query execution over time.

That information consists of two things:
• Execution Plans – the execution plans generated for each query are stored in the query store, and if the plan changes the new plan is also stored.
• Performance metrics – information such as CPU consumption, reads and writes, are captured and stored for each query.

This information is aggregated over intervals (default is one hour) so you can see how query performance changes over time.

This isn’t earth shatteringly new, you can already query to find out the execution plan for a query and you can also query to find aggregated performance metrics related to a given query.

The difference is that now a history can be maintained without implementing additional monitoring. Previously the performance metrics would be aggregated as a single total since the last restart of the SQL instance – and would be cleared at the next restart. Now they are persisted and time-sliced so you can actually measure change over time.

The simple activity of storing old execution plans is also profound for performance troubleshooting. Anyone who’s worked with large scale production data will have experienced the issue when a function that was working fine, fairly suddenly starts to develop performance problems.

A common cause of this is what’s known as “plan regression”. Basically this is where the execution plan has changed – and the new one’s just not as good as the old one for most executions. Previously you might suspect this was the cause of an issue you were seeing, but there was no way to easily prove it, now you can use query store to view and compare the old and new plans to verify this. You can even with a click or two force the query to go back to using the old (better) plan – though we hope people won’t overuse this and will try to at least delve into the cause and think about resolving it. There is usually a reason SQL thought the new plan would be better – and a particular plan may work for now but may not always be the best plan in the future as your data changes.

Let’s have a look at these features in little more detail.

Enabling Query Store
Query store is a database level configuration. It’s important to understand that, and that the information stored is actually stored within system tables in the database. That means that if you backup and restore the database, the information is retained. Also very importantly, the information is stored asynchronously – so there shouldn’t be any performance impact on the executed queries themselves. There will of course be some overall server overhead at the point the data does get saved, but that shouldn’t be too significant in most cases.

You can enable Query Store for a database through T-SQL (or in your source code) or through the GUI in SSMS. I just want to demonstrate doing this through the GUI so you can see some of the options. Right-click on the database, select properties, and then select the Query Store page all the way at the bottom:

QueryStore1

Above you can see Query Store enabled for the WideWorldImporters database, with all default settings.

The first setting is “Operation Mode”. By default this is set to “Off”. To enable Query Store and get it running for a particular database you change it to “Read Write”. Job Done.

The Data Flush interval is how often the query store data gets written to disk – remembered I said this was asynchronous.

The Statistics Collection interval determines the size of the time slices that your query performance metrics get aggregated into.

Then we have some stuff about data retention. It’s worth noting that if your query store fills up and nothing is happening to clear it out then it flips to Read-Only mode and won’t store any more data until space is freed up. The default amount of space set for it is 100MB – that’s not a lot of space so I really can’t see any justification from that point of view for not enabling this feature.

Leaving the “Size Based Cleanup Mode” set to Auto should make sure that old data gets purged if the query store starts to fill up. Above that is the “Query Store Capture Mode” – if you leave that to AUTO it will ignore infrequent queries or those with negligible overhead.

The last setting “Stale Query Threshold” is how long it keeps data for in days. So 30 days for default. I can see it being useful to up this significantly it we want to use Query Store to be able to monitor performance over a long period, but it may depend on how much space Query Store wants to consume for your database – remember the default is 100MB but you can up that to whatever you like.

At the bottom of the properties page you can also see some nice pie charts that show how much of a proportion of your database Query Store has allocated, and how much of that space it is using.

So that’s Query store set up and configured, let’s have a look at a few of the things it gives us.

Query Store in Action and Forcing a Plan
I’ve set up Query Store as above in a copy of the WideWorldImporters databases on a SQL 2016 instance. I’ve created a stored procedure that I’m running every two seconds and I’ve set the Statistics Collection Interval in Query Store to 1 minute (rather than an hour) so that I can get some figures and graphs out fairly quickly.

Under the database in SSMS, there is now a Query Store folder where some built in reports reside:

QueryStore2

For the sake of this blog post I’m just going to look at a couple of these. Let’s open the “Top Resource Consuming Queries” Report:

QueryStore3

You can see a few things here. On the top left is a bar chart of the most expensive queries (you’ll notice one large one and the rest are negligible in comparison – the large one’s my query). You can configure whether you want to look by CPU or Logical Reads amongst other options and whether you want to look at averages, or maximums or minimums. Basically there a whole bunch of ways you can configure your view.

I’ll be honest that I struggled with some of these built-in Query Store reports to get them to show me what I wanted, so expect a bit of playing around to figure things out if you are using this feature.

In the bar chart, the bar highlighted in green is the currently selected query, on the right we can then see a scatter graph of the execution figures for this query across our statistics intervals (remember I’ve set it to have intervals of 1 minute). You can see I’m looking at average logical reads. You will also see that this query was ticking along nicely until about 14:05 when something happened (that was me!) and the logical reads for each execution of the query suddenly shot up. The blobs on the scatter graph have also changed colour at this point and that represents that the query is now using a new execution plan.

Next to this graph is a key telling us which plan each colour of blob represents and if you click on the plan in the key that you want, the plan itself is displayed in the bottom pane. At the moment I’m looking at the original plan (Plan 1). You will notice that the title specifies that it is “not forced”, you’ll also notice a button to the right of the title that gives us the option to “Force Plan”. Let’s just hold off a minute before we do that.

Before we change anything to try and fix the issue with this query, let’s look at the “Regressed Queries” report. This is pretty similar, but you may use it from a different direction. I.e. it may not be one your most expensive queries that has started going bad, so if you look in the Regressed Queries report it will focus on ones for which the execution plan has changed in the interval you are looking at. Again I found it a little challenging to get this report to show me the query I was interested in, some playing around can be required:

QueryStore4

You can see here that I have just one big fat bar on my bar chart – as I only have one regressed query in the interval (last 30 minutes) I chose to look at. This can make it easier to identify queries suffering this issue.

I’m going to go back the the previous Resource Consumers report and try and fix the problem. Now, in reality I know what I did and why the query went bad. It was the result of something caused parameter sniffing, which is where, if a stored procedure recompiles, the execution plan that is formed may be different depending on the parameters it is executed with. Basically it forms the best plan for the parameters supplied – but that might not be the best plan for all sets of parameters. In this case I forced the stored procedure to form a plan that was going to be expensive in most cases. More on that in my next set of performance tuning workshops.

That issue would be best fixed in the code of the stored procedure, but in production, turning around a fix may take days and we have the problem right now. So let’s use the Force Plan functionality to fix the symptom – just for the moment.

I select the plan I want, and I click the “Force Plan” button. The effect is immediate and I notice it within minutes because my statistics collection interval is so small. I’ll let it run for a bit and then show you the new graph:

QueryStore5

You can see the query has now returned back to healthy (quick) execution. Note the Orange blobs all now have a tick over them to denote that this plan is now forced.

Comparing Plans

A related feature in SQL 2016 is the ability to compare two execution plans to see what’s changed. I haven’t found this that amazing myself when I’ve looked at it, but that’s mainly due to the natural limitations – if two plans are significantly different then something that highlights the differences is just going to highlight the whole thing. However it can be useful at least to have both plans on screen at the same time so you can go back and forth easily.

You don’t need to do this through Query Store – if you right-click on any Execution Plan in SSMS there is now the “Compare ShowPlan” option, and as long as you have the plan you want to compare against saved as a file then you can go ahead. Note that one good thing is that this is an SSMS feature, so as long as you have SSMS 2016 or higher you can use it to compare plans from on earlier versions of SQL Server.

With Query Store you can compare plans directly from the Store. If we go back to one of the reports above, the plans are listed in the key for the scatter graph. You can select more than one by using Shift+Click. Then you can click the button in the toolbar above the scatter graph which has the ToolTip “Compare the Plans for the selected query in separate window.”

Let’s do that for the two plans formed for our query above. The resulting view shows us two views side by side. It may be useful to look at these separately so they fit better on this page. On the left we have:

QueryStore6

The area highlighted in Red is where the tool had identified that the two plans are the same. The rest it is not so sure about. All the same it’s a nice visual view just to be able to see what both plans are doing. On the right hand side of the screen you then get this view:

QueryStore7

This shows us a comparison of the properties of whichever operator is selected in each plan – note this need not be equivalent operator in each plan. You can select the Nested Loop operator in the top and the Index Scan operator in the bottom and it will show you those – though the comparison may not be that meaningful!

So, this is kind of a useful tool, but don’t expect it to magically solve the process of comparing plans for you. In general too much changes from one plan to the next for it to be that simple – but this may help – a bit…

Query Store Catalog Views

Like everything else in SQL Server – all the information you can access through the GUI in SSMS, is available directly through the use of system catalogs and views. So if you want to examine the information in a way that isn’t supported by the built in reports then you can just go ahead and write your own queries.

The new views available are:

sys.database_query_store_options
sys.query_context_settings
sys.query_store_plan
sys.query_store_query
sys.query_store_query_text
sys.query_store_runtime_stats
sys.query_store_runtime_stats_interval

Rather than me going into detail here, I’ll just refer you to the MSDN reference:

https://msdn.microsoft.com/en-gb/library/dn818149.aspx

Conclusions

Query store is a great feature. It’s not rocket-science but it is very useful. In particular it massively aids the investigation of production issues, saving time for those troubleshooting them at exactly the point they need time saving – right when everything’s hitting the fan.

It is also very useful for monitoring performance over time and being able to keep ahead of scalability issues with specific queries.

The methods for forcing a plan are also excellent for quick fixes – but try not to overuse them. It is possible to force plans in earlier versions of SQL – but tricky, so people usually just fixed the code. Forcing plans can end up being a case of treating the symptoms rather than the cause – and can lead to other problems later on.

Other Posts about Query Store

Identify the (Top 20) most expensive queries across your SQL Server using Query Store

How does Query Store capture cross database queries?

Why Does Everyone Have More Friends Than You?

No, the answer isn’t because you’re a DBA.

This isn’t a technical post about databases, but rather a discussion of a statistical paradox that I read about recently. Statistics and data often go hand in hand, and many of us who work with data often use statistics in our work – particularly if we cross over into BI, Machine Learning or Data Science.

So, let’s state the problem.

I think I have an average number of friends, but it seems like most people have more friends than me.

I can see this on facebook – even though facebook friendship isn’t the same thing as the real kind. A quick google tells me that the average number of facebook friends is 338 and the median is 200. My number is 238. That’s less than average but greater than the median – so according to those figures I do have more friends than most people (on facebook at least).

I could probably do with culling some of those though…

I’m going to pick ten of those facebook friends at random. I use a random number generator to pick a number from 1 to 20 and I’m going to pick that friend and the 20th friend after that, repeating until I get 10 people.

I get the number 15 to start so let’s start gathering some data. How many friends does my 15th friend have, my 35th, my 55th, 75th, 95th 115th etc.

Here’s the numbers sorted lowest to highest:

93
226
239
319
359
422
518
667
1547
4150

If I look at that list only two people have less friends than me – or 20%. 80% have more friends than me. Why do I suddenly feel lonely \ How can this be?

If I have more friends than the median, then I should be in the second half.

Friendship is based on random accidents, and I’ve picked people out of my friends list at random. Surely I’ve made a mistake.

I could repeat the sampling but I’m likely to get similar findings.

The answer is that it’s not all random, or at least not evenly so. The person with over 4,000 friends is over 40 times as likely to know me as the person with less than 100. Maybe they get out a lot more (actually they’re a musician).

I’m more likely to know people if they have a lot of friends than if they have fewer.

This is difficult to get your head round, but it’s important if you’re ever in the business of making inferences from sampled data. It’s called “The Inspection Paradox”.

It’s also one of many reasons why people may get feelings of inferiority from looking at social media.

You can find a lot more examples and explanations in this post:

https://towardsdatascience.com/the-inspection-paradox-is-everywhere-2ef1c2e9d709

SQL Server Wishlist

This month for T-SQL Tuesday Kevin Chant asks us what our fantasy SQL feature would be.

https://www.kevinrchant.com/2019/09/03/t-sql-tuesday-118-your-fantasy-sql-feature/

I think it’s appropriate to give a shout-out to Microsoft at this point, because over the last few releases they’ve given us some of the items that are top of my list.

Recommending and setting MAXDOP during setup (coming with SQL 2019) will hopefully mean I no longer have to have arguments about why the out-of the-box setting isn’t appropriate.

The same with setting max memory in the setup (also with SQL 2019).

A more verbose error where string or binary data might be truncated – we got that in SQL 2017.

It’s the little things like these that make me happy – and make my job easier.

A few other little things I’d like – though I accept that something small to describe isn’t always small in its execution…

A change to the default cost threshold for parallelism – be it 20, be it 30, be it 50. I’d prefer any of those to 5.

I think it would also be great to have a cardinality optimizer hint, e.g. OPTIMIZE FOR 5 ROWS. Oracle has this, and it’s not good having to feel jealous of those working on the dark side 😉 You can do the equivalent but it’s convoluted and not clear what’s going on when the uninitiated see the code:

https://dba.stackexchange.com/questions/168726/sql-server-cardinality-hint

There is one big thing I’d like – but it’s never going to happen. Get rid of Enterprise Edition – or rather, make Enterprise Edition the standard. Enterprise is comparatively so expensive, it’s rare I’m going to recommend it. It’s interesting to see that in Azure SQLDB we only have one edition to work with – I’d love to see that in the box product. I understand that change would be a massive revenue loss so can’t see it happening.

If not that though, if we could just have at-rest encryption (i.e. TDE) in the Standard Edition that would make me very happy. In these days of security and privacy consciousness it seems that should be a core  functionality.

UPDATE: TDE is going to be available on Standard Edition from SQL Server 2019. I get my wish!

Finally, I’d just like to upvote Brent’s idea that it would be great to be able to restore just a single table.

That all said, I’d like to go back to my first point. I love what MS is doing with SQL Server, and the continual improvements that are being made. I particularly love that we sometimes get them for free in service packs and cumulative updates – without having to upgrade to a new version.

Keep it coming!

The 99 Flavours of SQL Server are all Vanilla

There are so many options for running SQL Server now. It can feel a little intimidating. 

You’ve spent years gaining experience with SQL running on Windows Server. On bare metal or in a VM. Locally or in a data centre.

Now people are talking about SQL on Linux. SQL on VMs in the cloud. SQL Server Platform as a Service (PaaS) offering such as Azure SQL DB, Azure Managed Instance, or Amazon RDS. 

Running SQL in a container on Docker or Kubernetes. 

It’s easy to feel like you’re at risk of getting left behind. Feel like you’ve got to start learning from the beginning all over again.

The good news is that whatever the flavour – it’s still SQL Server. For everything that’s different there are a hundred things that are still the same.

Database design, writing and executing queries and stored procedures – it’s still the same.

Query tuning, indexing, looking at execution plans. All the same.

Even administration, many of the core concepts are the same, understanding how security works, backups, high-availability. The main difference is often that some of these might be taken care of for you and you don’t need to worry about them any more.

Caveat – you still need to worry about them a bit!

The point is, most of what you already know, the experience you have gained over the years, is still totally valid. Learning about SQL Server on a new platform may feel like a big learning curve, but in reality, the new stuff you need to get to grips with is small compared to all the stuff you already know.

And in some cases, the skills you already have become even more valuable. People might not care if your query tuning on physical kit takes your CPU down from 50% to 10%. But tell them you’ve just reduced their cloud bill by 80% and they really care!

So don’t be intimidated, and don’t feel you need to learn every flavour. Have a play with SQL Server in the cloud, have a play with containers, set up SQL on Linux. You’ll quickly find it’s not that hard, and once it’s running – it’s pretty much the same as ever.

And remember, if someone comes to you with a question about why SQL is running slow, or why a query isn’t doing what they want – on RDS, Docker, Linux, or whatever. You don’t need to know that platform inside out to be able to help, you already know SQL Server and that’s the important bit. 

To paraphrase a popular lyric :

If you’ve got SQL problems I can help you son. I’ve got 99 problems but SQL aint one.

Query Store stores dates as UTC

One of the powerful aspects of Query Store is the ability to directly query the DMVs for details of historical executions and performance.

A key view for this is sys.query_store_runtime_stats (but also sys.query_store_runtime_stats_interval). 

If you’re querying these views to look at what was happening during a particular time period then it’s important to understand that the dates and times are stored according to UTC time (which is equivalent to GMT with no adjustment for daylight savings times).

You can see this if you look at a few rows from the view:

SELECT runtime_stats_id, first_execution_time, last_execution_time
FROM sys.query_store_runtime_stats;

Though it will be more obvious to you if you’re in a time zone other than the UK.

The datetimes are stored as DATETIMEOFFSET which you can see from the +00:00 at the end of each entry. DATETIMEOFFSET allows you to store both the datetime as well as the timezone being used. 

This means that if you’re querying against these columns you need to convert the values you are looking for to UTC first. You can do that my making sure you use GETUTCDATE() instead of GETDATE(), or if you are using specific local times you can use the AT TIME ZONE function e.g.

SELECT CAST('2019-08-21 11:50:40.400 +9:00' AS datetimeoffset) AT TIME ZONE('UTC');

I’m a big fan of storing all datetimes as the UTC version. It avoids things going out of sequence due to daylight saving changes – and can be helpful in mitigating problems when you have application users in multiple countries accessing the same database.

I’ll admit I might be biased though as – being based in the UK – UTC is the same as my local time half the year.

Adding a Column with a Default Constraint

Last week a question came up about adding a column to a table, and giving that column a default constraint. Would that default value be assigned to all existing rows, and how much processing would be involved.

Unsurprisingly, the answer is that – “it depends”.

I’ve got a table with about a million rows that just has an identity column and a text column I’ve populated from sys.objects:

CREATE TABLE dbo.TestAddColumn (Id int IDENTITY(1,1), TextValue sysname);
INSERT INTO dbo.TestAddColumn(TextValue)
SELECT a.name FROM sys.objects a, sys.objects b, sys.objects c;

Let’s add a nullable bit column and give it a default value of Zero:

ALTER TABLE dbo.TestAddColumn
ADD NewFlag bit NULL CONSTRAINT DF_TestAddColumn_NewFlag DEFAULT 0;

If I look at the table I can see that the new column contains Null values:

i.e. the default value that I’ve defined hasn’t been assigned for existing rows.

I remove the column and the default constraint:

ALTER TABLE dbo.TestAddColumn
DROP CONSTRAINT DF_TestAddColumn_NewFlag;
ALTER TABLE dbo.TestAddColumn DROP COLUMN NewFlag;

Now let’s add the same column but we’ll disallow Null values:

ALTER TABLE dbo.TestAddColumn
ADD NewFlag bit NOT NULL CONSTRAINT DF_TestAddColumn_NewFlag DEFAULT 0;

If we check the table again:

This time we can see that the default value has been assigned.

So whether our default value gets assigned to existing rows depends on whether your column is nullable or not, a nullable column will retain Null as the value. A non-nullable column will get assigned the new default value.

If you want to override that behaviour, and have your default assigned even where the column is nullable, you can use the WITH VALUES statement. First I’ll remove the constraint and column then add it again with values:

ALTER TABLE dbo.TestAddColumn
ADD NewFlag bit NULL CONSTRAINT DF_TestAddColumn_NewFlag DEFAULT 0 WITH VALUES;

We look at the data again:

You can see that the value has been assigned even though the column is Nullable.

One neat thing to note, is the performance impact when carrying out these actions.

Each time I added the column I captured the execution overhead using:

SET STATISTICS IO, TIME ON;

In all cases the resource usage measured was Zero. The Add Column operation was a meta-data only operation – no data in the table needed to be updated – even where the new column was assigned a value.

This was some clever jiggery-pokery added in SQL Server 2012 .

Thanks for reading!