SQL Server Agent and Daylight Savings time

There was a bit of chat today on the SQL Community slack about UTC and Daylight Savings Time. It reminded me I have a post in draft somewhere about the complications of dealing with timezones in SQL that I should get around to finishing.

It also reminded me of a fact I found in an old Paul Randall post from 2007 that not a lot of people seem to know so I thought it was worth sharing again.

I discovered it when I was working on a kind of hand-cranked version of log shipping with a twist for SQL Server. Before you ask me why… don’t! I promise it was for a good reason. At least it seemed like a good idea at the time…

Anyway, I was aware that I needed to work out what I was going to do when the clocks go back in October of each year.

I decided to try and find out what the out of the box log shipping does to cater for this. I started with Google and came across this 2007 article by Paul Randall:

How does daylight savings time affect disaster recovery? – Paul S. Randal

Apparently SQL Server Agent pauses for an hour after the clocks go back. In the UK the clocks go back at 2 AM on the last Sunday in October. So all the times from 1 AM to 2 AM effectively happen twice. SQL Server Agent will run jobs from 1 AM to 2 AM as usual, then when at 2 AM the clocks go back to 1 AM, SQL Agent will pause for an hour (i.e. no new jobs will be started) and only resume the second time it gets to 2 AM.

Sounds a bit crazy in some ways, but I tend to believe most things Paul Randall says and it does seem to have been proved since in practice.

It means that you don’t get a confusing sequence of times in your job history and other events that don’t really make sense, and it also means that jobs won’t happen twice e.g. if you had an index rebuild in that time.

It could have issues for DR though, imagine that this period is actually an active one for your application and that you have an RPO (recovery point objective) of a maximum of (let’s say) 5 minutes data loss. During that hour, no LOG backups are being taken. Suddenly you’re vulnerable to not being able to meet your DR service agreement. Could be something to watch out for.

For me though (at the time I learnt this) it was great as I didn’t need to worry about coding a workaround in my log restore code!

 

SQL Puzzle 2: Eight Queens

This puzzle was first proposed in 1848 by a composer of chess puzzles called Max Bezzel and has since spawned much analysis and many variants. Simply phrased, it goes as follows:

“Can you place 8 queen’s on a standard (8×8) chessboard so that no two queen’s threaten each other?”

(Just in case, I’ll remind you a queen can move and attack any number of squares in a straight line –  horizontally, vertically, or diagonally)

Here’s one example solution:

8queens

By all means have some fun trying to find extra solutions to this with pencil and paper or chessboard, however we do have computers (and SQL!) these days, so this month’s challenge is to try find ALL the possible solutions to this puzzle using T-SQL.

I’m a bit of a closet chess enthusiast, so expect more in this theme!

Enjoy 🙂

Think twice before using table variables

T-SQL Tuesday

tsql2sday150x150

For T-SQL Tuesday this month Raul Gonzalez has asked us all to blog about lessons learnt the hard way:

http://www.sqldoubleg.com/2017/07/03/tsql2sday-92-lessons-learned-the-hard-way/

My biggest sins have been executing code against production environments when I thought I was pointing at my local machine:

DELETE FROM dbo.Blah;
DROP DATABASE Blah;

I’ve learned from those experiences – mostly that I prefer it when I don’t have access to production!

As I’ve been doing a few posts about Statistics and Cardinality estimation recently I thought I’d cover instead the subject of Table Variables. This has maybe already been blogged to death, but it’s still one of the most common anti-patterns I see when performance tuning and a lesson I have seen so many people forced learn the hard way.

Table Variables vs Temp Tables

Those experienced with programming SQL will be familiar with both Temp Tables, and Table Variables as ways of storing intermediate resultsets within (for instance) a stored procedure.

Table variables came along with SQL Server 2000 and there seems to have been a wave of belief arriving with them that they were the latest and greatest thing. That Microsoft wanted us to use them from now on and stop using Temp tables. The day of the Temp table was over, long live the Table Variable!

Many developers reworked their code to replace all their temp tables with variables.

Uh-oh…

Initial guidance from Microsoft recommended table variables as the default except “where  there is a significant volume of data and there is repeated use of the table”

https://support.microsoft.com/en-us/help/305977/inf-frequently-asked-questions—sql-server-2000—table-variables

They did however stress that you should test for any given scenario.

More recently they’ve updated their documentation to suggest “you should be cautious about using a table variable if you expect a larger number of rows (greater than 100).”

https://docs.microsoft.com/en-us/sql/t-sql/data-types/table-transact-sql

I often hear even more cautious advice:

  • Only use a variable if there will be less than 30 rows
  • Only use a variable where there will only be 1 row
  • Never use table variables! (unless you have to)

I tend to go with one of the last two, there any many more examples where a table variable can screw up your performance than there are with temp tables.

 There are times though where you still might want to use a table variable. They can be a little easier to work with, and sometime they might offer functionality you need (such as returning results from a table value function).

Just to clear up one misconception, both types of object are stored in TempDB and both of them will be mostly operative in memory rather than writing their data to disk (unless there are memory pressures, or bad estimates cause them to spill to disk). Even this is a bit more complicated than that, but in general they are managed the same way in the background.

What’s the big difference?

The big difference, and the reason you often don’t want to use a table variable is about statistics and cardinality estimation (which we’ve been looking at in the last few posts). In brief recap – statistics are what SQL uses to estimate how many rows (the cardinality) will be generated during part of a query operation. If it gets this estimation about right it can use that to form a good plan of attack for executing the query. If it gets it wrong it can make a dogs dinner of it.

And here’s the thing. Temp tables have statistics generated on them, but Table Variables don’t. As far as SQL’s concerned your table variable has exactly one row in it – regardless of reality.

Caveat: The way Temp tables manage statistics is a little bit weird, so you can get unpredictable result sometimes – but at least they try! See this if you want a really deep dive on the subject:

Page Free Space : Temporary Tables in Stored Procedures

Anyway, let’s look at a quick couple of examples of this and how it can make things go wrong.

Example 1: Getting that estimation badly wrong

First of all just to prove the point. Let’s create a table variable and a temp table and dump a million rows into each so we can look at how SQL manages estimates from them.

Table variable first:

DECLARE @a TABLE(i INT);
 
--Create a million rows
WITH Nums(i) AS
(
       SELECT 1
       FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) n(i)
)
INSERT INTO @a(i)
SELECT n1.i
FROM Nums n1 
CROSS JOIN Nums n2 
CROSS JOIN Nums n3 
CROSS JOIN Nums n4 
CROSS JOIN Nums n5 
CROSS JOIN Nums n6;
 
SELECT i FROM @a;

I clicked the button in SSMS to capture the actual execution plan for this so we can see what is going on under the covers. I’m only interested in the final SELECT statement, which produces the pretty basic plan:

Variables1

What I’m really interested in though is the properties of the Table Scan, which you can get by right-clicking and selecting Properties – or just hover over to get the Tool Tip you see below:

Variables2

Look at where it says “Number of Rows Read” and “Actual Number of Rows” and you’ll see 1,000,000. Then look down to the Estimated Number of Rows and you’ll see 1. That’s quite a big difference.

Just to prove this is because we’re using a table variable, let’s quickly look at the same example with a temp table:

CREATE TABLE #a(i INT);
 
-- Create a million rows
WITH Nums(i) AS
(
       SELECT 1
       FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) n(i)
)
INSERT INTO #a(i)
SELECT n1.i
FROM Nums n1 
CROSS JOIN Nums n2 
CROSS JOIN Nums n3 
CROSS JOIN Nums n4 
CROSS JOIN Nums n5 
CROSS JOIN Nums n6;
 
SELECT i FROM #a;
 
DROP TABLE #a;

Here again’s the plan for the select, you’ll see it’s exactly the same:

Variables3

But look at the values in the Tool Tip this time:

Variables4

Now you’ll see all the values for number of rows are 1,000,000 – including the estimated number or rows which is very important as that’s the one SQL uses to work out what execution plan to use.

So that example was just to show how badly wrong the estimation is, but it probably didn’t hurt us that much as SQL selected the same plan in both cases. Let’s look at an example where it does bring some pain.

 

Example 2: Selecting a dodgy execution plan

The following query uses the AdventureWorks2012 database:

DECLARE @BusinessEntityId TABLE (BusinessEntityID INT NOT NULL PRIMARY KEY CLUSTERED);
 
INSERT INTO @BusinessEntityId
SELECT BusinessEntityID FROM Person.Person;
 
SET STATISTICS IO ON;
SET STATISTICS TIME ON;
 
SELECT COUNT(*)
FROM @BusinessEntityId b
INNER JOIN Person.Person p
       ON b.BusinessEntityID = p.BusinessEntityID;
 
SET STATISTICS IO OFF;
SET STATISTICS TIME OFF;

Now let’s look at the execution plan for that final select:

Variables5

Nothing that’s too obviously controversial. It’s doing a Clustered Index scan on our Table Variable which is fine as we want all the rows that match a value in there. Then it’s doing an index seek on the Person table so it’s going straight to the results we want. Great!

Or Not…

Let’s look at the Tool Tips again, first for the Clustered Index Scan:

Variables6

We focus on the same entries of before. Actual number of rows 19,772. Estimated number of Rows 1. Bit of a discrepancy. Is that affecting our plan though? All we can be sure of at this stage is that SQL thinks that’s a pretty good plan if we only had one row in the table variable. But we have a lot more than that.

Let’s look at the Tool Tip for the Clustered Index Seek:

Variables7

We can look at the same entries as before and we see the same discrepancy. But also look at the Number of Executions – SQL estimates it would have to execute this operator once as it thinks there’s exactly one row in the table variable. In actuality it executed it 19,772 times. That’s 19,772 seeks on the same table. I happen to know that’s the same as the number of records in the table. Surely there’s a better way of doing this.

In general the Nested Loops operator you see in the execution plan is good when the first( top) table is small compared to the second (bottom) table. Maybe it’s not the best choice in this case.

Now let’s look at the same example with a temp table:

CREATE TABLE  #BusinessEntityId (BusinessEntityID INT NOT NULL PRIMARY KEY CLUSTERED);
 
INSERT INTO #BusinessEntityId
SELECT BusinessEntityID FROM Person.Person;
 
SET STATISTICS IO ON;
SET STATISTICS TIME ON;
 
SELECT COUNT(*)
FROM #BusinessEntityId b
INNER JOIN Person.Person p
       ON b.BusinessEntityID = p.BusinessEntityID;
 
SET STATISTICS IO OFF;
SET STATISTICS TIME OFF;
 
DROP TABLE #BusinessEntityId;

Here’s the plan:

Variables8

You’ll hopefully see that this plan is different. It’s using something called a Hash Match operator – this creates a hashed value based on each row from the top input (the index scan on our temp table), and then feed into that an index scan from our bottom input – the Person table.

Rather than going into too much the implications of that, let’s jump straight to looking at the Tool Tips.

From the top operator first, the Clustered Index Scan:

Variables9

We can see here that the actual and estimated number of rows are both the same. Good so far.

Now to look at the bottom operator, the index scan on the Person table:

Variables10

Now we can see that the estimated and actual number of rows are the same again, but also the estimated and actual number of executions are the same – 1 in both cases.

So we have the query with the table variable generating an execution plan that results in nearly 20,000 seeks against an index vs. the query with a temp table generating 1 scan against the same index. In general seeks are quicker than scans, but probably not 20,000 times quicker, even with the effort of the Hashing part of the operation. To quantify the difference we can look at the output from the STATISTICS IO command for each query:

Table variable:

Table ‘Person’. Scan count 0, logical reads 59916, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table ‘#B182BEEB’. Scan count 1, logical reads 35, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 

 SQL Server Execution Times:

   CPU time = 47 ms,  elapsed time = 39 ms.

 

Temp table:

Table ‘Workfile’. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table ‘Person’. Scan count 1, logical reads 67, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table ‘#BusinessEntityId___________________________________________________________________________________________________00000000013D’. Scan count 1, logical reads 35, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 

 SQL Server Execution Times:

   CPU time = 15 ms,  elapsed time = 13 ms.

 

The output from the temp table query looks more complicated as it mentions a Workfile and Worktable that don’t get used, but it’s easy enough to see that Temp Table runs 3-4 times quicker, and more significantly generates only about 100 logical reads as opposed to about 60,000 for the table variable query. That’s a VERY big difference.

So like I said at the top, if you’re thinking about using a table variable, think twice, it might not be what you want to do.

 

 

When do Distribution Statistics Get Updated?

Statistics objects are important to us for allowing SQL to make good estimates of the row-counts involved in different parts of a given query and to allow the SQL Optimiser to form efficient execution plans to delivery those query results.

Statistics get updated automatically when you rebuild (or re-organise) an index they are based on – but we only tend to rebuild indexes that are fragmented, and we don’t need fragmentation for statistics to be stale. We also may have many auto-created statistics objects that are not related to an index at all.

It’s generally recommended to have the database level setting AUTO_UPDATE_STATISTICS turned on, so that SQL can manage the process of keeping statistics up to date for us. The only excuse to turn it off is that you are managing the updates to stats yourself in a different manner. And you can always turn the auto update off at an individual table or statistics level if you need to, rather than for the whole database.

SQL Server has had the ability to automatically update statistics since version 7.0. Nonetheless for a long part of my career working with SQL Server, whenever a performance issue raised its head everyone’s knee-jerk response would be “Update Statistics!” In most cases though the people shouting that didn’t really understand what the “Statistics” were, or what mechanisms might already be in place for keeping them up to date.

Of course SQL Server isn’t perfect and sometimes it is helpful for human intelligence to intervene. But to provide intelligent intervention one has to understand how things work.

So how does the automatic updating of statistics work?

In the background SQL maintains a count of changes to tables that might affect statistics. This can be updates, inserts or deletes. So if I inserted 100 records, updated 100 records and then deleted 100 records, I would have made 300 changes.

When SQL forms an execution plan for a query it references various distribution statistics objects to estimate row-counts and to use that to try find the best plan. The statistics objects it looks at are referred to as being “interesting” in the context of the query.

Before using values from the statistics, the Optimizer will check to see if the statistics are “stale”, i.e. the modification counter exceeds a given threshold. If it does, SQL will trigger a resampling of the statistics before going on to form an execution plan. This means that the plan will be formed against up to date statistics for the table.

For subsequent executions of the query, the existing plan will be loaded from the plan cache. Within the plan, the Optimiser can see a list of the statistics objects that were deemed “interesting” in the first place. Once again it will check each of them to see if they are “stale”. If they are, an auto-update of the statistics object(s) will be triggered and once that is complete the plan will be recompiled, in case the updated statistics might suggest a better way of executing the query. Equally, if any of the statistics objects have been updated since the last execution then the plan will also be recompiled.

One important caveat to this is the database level setting AUTO_UPDATE_STATS_ASYNC (Asynchronously). Generally it is best to have this turned off, in which case the above behaviour is observed. If you turn it on however, in the case of stale stats the query execution will not wait for the stats to be updated, but will start them updating in the background while the query executes. The plan will only recompile to be based on the new stats at the next execution.

From SQL Server2008 R2 SP2 and SQL Server 2012 SP1 we have a new DMF (Dynamic Management Function) sys.dm_db_stats_properties that allows us to see how many row modifications have been captured against a given statistics object as well as when it was last refreshed, how many rows were sampled etc. Modifications are captured on a per column basis (though when statistics were originally introduced in SQL Server it was per table) so the counter will only be affected if the leading column for the statistics object has been affected by a given operation.

SELECT
s.name AS StatsName, sp.*
FROM sys.stats s
CROSS apply sys.dm_db_stats_properties(s.OBJECT_ID, s.stats_id) sp
WHERE s.name = 'IX_Test_TextValue'

Results:

Statistics_Properties

So what are the thresholds?

For a long time the thresholds were as follows. Statistics were considered stale if one of the following was true:

  • The table size has gone from 0 rows to more than 0 rows
  • The table had 500 rows or less when the statistics were last sampled and has since had more than 500 modifications
  • The table had more than 500 rows when the statistics were last sampled and the number of modifications is more than 500 + 20% of the row-count when the statistics were last sampled (when talking about tables with larger row-counts a lot of the documentation just describes this as 20% as the additional 500 becomes less and less relevant the larger the number you are dealing with).

Those thresholds did mean that when a table had a large number of rows, Statistics might not get updated that often. A table with a million rows would only have stats updated if about 200,000 rows changed. Depending on the distribution of the data and how it is being queried this could be a problem.

So, in SQL 2008 R2 SP2 Microsoft introduced Traceflag 2371 which when set would reduce the stale statistics threshold for larger tables. From SQL 2016 this is the default functionality.

That adds the following test for statistics being stale:

  • If the number of rows (R) when the statistics were last sampled is 25,000 or more and the number of modifications is more than the square root of R x 1000:

Statistics_1000R

Now, I’m just going to correct myself here, the documentation I’ve found SAYS the threshold is 25,000 but when I started to have a play that didn’t seem to be the case at all.

What actually seems to happen is that whichever of the two estimates is smaller gets used i.e

Either:

Statistics_20pcnt

Or:

Statistics_1000R

Whichever is smaller.

I don’t know if this means that both get evaluated and the smaller is used, or if the threshold between the two rules is simply defined at the point where the second formula gives the smaller result – which is after 19,682 rows. I discovered that threshold by solving where the two equations above would give the same result – then by experimenting to prove it in practice.

I think this incorrect stating of 25,000 as the threshold probably comes from confusion, taking an approximation (20%) as the actual figure. Remember I mentioned that people often generalise to say that statistics are stale after 20% of the rows change, and forget about the extra 500 rows. If that was true and it was exactly 20%, then the threshold would be 25,000 as that would be the point that both equations are equal.

Anyway it’s not even vaguely important to know that. I just found it interesting! Note that the tests above were carried out on SQL Server 2012 SP3 so could well be different on later versions.

To more visually understand the above rules, here’s a table showing the thresholds for some example table sizes under both the Old algorithm (without the traceflag) and the New algorithm (with the traceflag or on SQL 2016 or later).

R is the number of rows when the statistics were last sampled and T is the number of modifications for statistics to be considered stale:

Statistics_Thresholds

You can see for the larger table sizes there is a massive difference. If you’ve got large tables you’re querying against and are having to update the statistics manually to keep them fresh then you may find implementing the traceflag is a help.

Row-count Estimates when there are no Statistics

I was discussing Cardinality Estimation with a colleague recently and the question came up, what cardinality does SQL Server use if you’re selecting from a column where there are no statistics available? I’ve discovered there are a few algorithms in play depending on how you’re querying the table. In this post we’ll look at where we have a predicate looking for a fixed value.

(If you want the short answer, it’s the fourth root of n cubed before SQL 2014 and the square root of n after that.)

This scenario can occur if you have AUTO CREATE STATISTICS turned off for your database, which we don’t recommend you do, but which you might choose to do anyway, and if you query a table with a predicate against a column with no index defined against it.
Let’s look at example. I’m using SQL Server 2012 for this investigation, and to start with am querying the AdventureWorks2012 database.
I’ve taken the following preparatory steps:

  • Set AUTO CREATE STATISTICS OFF for the database
  • Remove the Index on the LastName column for the Person.Person table
  • Removed any ad-hoc statistics that existed against the table

Then I run a simple query, with the Actual Execution Plan turned on:

SELECT *
FROM Person.Person p
WHERE p.LastName = 'Fox';

I only get one result out as there is only one Fox (Ms. Dorothy J.). Let’s look at the execution plan:

RowCountNoStatistics1

A clustered index scan as we might expect as I’ve removed any useful indexes from the table. You’ll notice there is a warning. If we view the tooltip you’ll see SQL warns us about the lack of statistics:

RowCountNoStatistics2

If we look at the estimated and actual row-counts we’ll see how that has affected us:

RowCountNoStatistics3

In the absence of any useful information – it knows the number of rows in the table but that is about it – SQL has estimated that there will be 1680 Foxes in the table. A bit of playing shows that we get the same estimate whatever value we search for.

If I turn AUTO CREATE STATISTICS on and run the query again then SQL generate a Statistics object against the LastName column and comes up with an estimate of 2.3 rows – which is a lot closer.

This matters a lot once you start running more complicated queries. The incorrect estimate is likely to affect the choice of plan that the optimizer makes, and may also affect the amount of memory it requests in order to run the query. Let’s look at a quick example of how the plan changes if we join the above query to another table.

First, without statistics (so I have to turn AUTO CREATE off again, and remove the statistics that got created):

SET STATISTICS IO ON;

SELECT e.EmailAddress
FROM Person.Person p
INNER JOIN Person.EmailAddress e
    ON p.BusinessEntityID = e.BusinessEntityID
WHERE LastName = 'Fox';

Here’s the execution plan:

RowCountNoStatistics4

You can see I’ve got a Merge Join as SQL thinks I’m expecting 1680 rows from the top table. A Nested Loops join would generally be better when I only expect one or two rows from that table.

I’ve also captured the IO so I can see how expensive the query was:

Table ‘EmailAddress’. Scan count 1, logical reads 10, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table ‘Person’. Scan count 1, logical reads 3818, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Let’s look at the behaviour of the same query with statistics creation enabled:

RowCountNoStatistics5

You can see we now have the desired Nested Loops join and the Clustered Index Scan on the EmailAddress table has been changed to a Seek.

The IO output is below:

Table ‘EmailAddress’. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table ‘Person’. Scan count 1, logical reads 3818, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

There’s not much difference in the overall IO, but you can see the Reads for the EmailAddress table have dropped from 10 to 2 due to the change from the Scan to the Seek. If the table was a lot bigger then we could see a large difference here.

So where does that estimate come from?

I thought I’d have a play and see if I could work out how SQL decided on that estimate of 1680 rows. I did some googling and found a suggestion that it might be a straight 9% of the total number of rows in the table, but that doesn’t quite add up and when I compared the same query pattern against a few tables I found I got a different ratio depending on the amount of rows in the table.

So I pumped rows incrementally into a fresh table and looked at the estimate and what the ratio was as the number of rows increased. Here’s my SQL for that:

--Create a Horrible Heap for my testing  
CREATE TABLE dbo.TestStats(TestVal INT NOT NULL, TestText VARCHAR(255) NULL);

--Insert a bunch of rows using dodgy old-style cross joins
INSERT INTO TestStats
SELECT TOP 1 --Amount of rows I'm going to insert
    1 , 'blah'
FROM sys.objects a, sys.objects b, sys.objects c, sys.objects d, sys.objects e;

--Clear the plan cache so SQL generated a new estimate  
DBCC freeproccache;

--Query the table, then I go check the execution plan generated 
--for the estimated number of rows
SELECT * FROM dbo.TestStats
WHERE TestVal = 1;   

(One thing to note was that I got the same answers whether I was querying the text column or the integer column – SQL seems to use the same algorithm for both.)

I started to notice a pattern quite quickly, that the ratio halved when the number of rows went up by a factor of 16. I then restarted my test, targeting my row-counts to be where the estimated number of rows would be a nice round number. You can see that in the table below:

RowCountNoStatistics6

I then attempted to work out a formula for that. Rather than take you through the shoddy process of mathematics that led me to an answer, I’ll just tell you that the formula came out as:

RowCountNoStatistics6b

Where e is the estimated number of rows for a given predicate value, and n is the total number of rows in the table. I checked that against the full set of results I’d gathered and it held true across all values of n I’d tested.

To check it finally against my original query – the Person.Person table had 19,972 rows. I put that through the calculator with the formula and get 1680.027. If we look back at the original estimate you’ll see that SQL stated 1680.03  – so that is all good.

As I mentioned earlier I was using SQL Server 2012 for this test, and a new Cardinality Estimator came into effect in SQL 2014. So I thought I’d run the test again with SQL 2016 and see if the results changed:

RowCountNoStatistics7

We can see the estimated rows drop off a lot quicker here. Clearly Microsoft have decided to lower the estimate. Actually it is now just the square root of the total number of rows.

RowCountNoStatistics8

Hopefully you’re not in the scenario where you regularly have queries running without the appropriate statistics to support them. The above comparison though shows us that if you have such a query its behaviour could dramatically change if you upgrade your SQL Server version to 2104 or higher. It could become better or it could become a lot worse.

There are a lot of changes like this that came in with the new version of the Cardinality Estimator in 2014. Places where underlying assumptions have been adjusted to make better guesses about the number of rows that will be returned by an operator. But they are still guesses based on the same information – there is no new data being captured in the Statistics to better inform the process. Of course Microsoft has made these changes to try and better model data out in the wild – but they are still fixed assumptions, which means sometimes they will be better and sometimes they will be worse.

One thing I should re-iterate is that these formulae we’ve discovered above are for a fairly specific querying pattern. There’s no guarantee that the calculation will be the same for a similar – but different query. It might be interesting to explore that further in a later post.

Also there may be other information in your database – such as constraints – that SQL can use to educate its guesses.

The main takeaway from all of this though, should of course be:

MAKE SURE AUTO CREATE STATISTICS IS TURNED ON FOR YOUR DATABASES!

Statistics and Cardinality Estimation

Cardinality

This is a term originally from Mathematics, generally defined as “The number of objects in a given set or grouping”. In SQL we’re continually dealing with sets so this becomes a very relevant topic, which in our context is just the “number of rows”.

When you have a query across multiple tables there any many ways in which SQL Server could decide to physically go about getting you the results you want. It could query and join the tables in any order and it could use different methods for matching records from the various tables you have joined together. It also needs to know how much memory to allocate for the operation – and to do that it needs to have an idea of the amount of data generated at each stage of processing.

A lot of this requires cardinality estimation, and SQL Server uses something called Statistics objects to perform that calculation.

Let’s look at a simple example:
SELECT *
FROM Person.Person p
INNER JOIN Person.[Address] a
ON p.AddressId = a.AddressId
WHERE p.LastName = 'Smith'
AND a.City = 'Bristol'

When it comes to gathering the results for this query there are a number of ways the database engine could go about it. For instance:

a) It could find all the records in the Person table with a LastName of Smith, look each of their addresses up and return only the ones who live in Bristol.
b) It could find all the Addresses in Bristol, look up the people associated with each address and return only the ones called Smith.
c) It could grab the set of people called Smith from the People table, grab all the addresses in Bristol, and only finally match up the records between those two sets.

Which of those operations is going to be most efficient depends very much on the number of records returned from each table. Let’s say that we have a million people called Smith, but there’s only one address in our whole database that is in Bristol (and let’s say that address does actually belong to someone called Smith).

In the first method above I would grab the million Smiths and then look their address up one by one in the address table until I found the one that lived in Bristol.

If I used method b) though, I would find the one matching address first, and then I would simply look up the owner of that address. Clearly in this rather contrived example, that’s going to be a lot quicker. So if SQL knows ahead of time roughly how many records to expect from each part of the query, hopefully it can make a good decision about how to get the data.

But how can it work out how many rows will be returned without actually running the query?

Statistics

That’s where statistics objects come in. SQL Server maintains in the background data that equates to a histogram showing the distribution of the data in certain columns within a table. It does this any time you create an index – statistics will be generated on the columns the index is defined against, but it also does it any time it determines that it would be useful. So if SQL encounters a Where clause on Person.LastName – and that column isn’t involved in a useful index, SQL is likely to generate a statistics object to tell it about the distribution of data in that column.

I say “likely to” because it actually depends on the settings of your SQL instance. Server configuration is beyond the scope of this post but suffice to say you can let SQL automatically create Statistics objects – or not. You can let it automatically update them when the data has changed by more than a given threshold – or not. And you can specify whether updates to statistics should happen asynchronously or synchronously – i.e. in the latter case if your query determines that statistics needs updating then it will kick that off and wait until the update is complete before processing the query.

It’s generally recommended that auto creation and updating is on, and async updating is off.

Viewing Statistics Objects
Let’s have a look at some actual statistics and see what they hold. There are a couple of ways of doing this, the first is through SSMS. If you look under a table in the object browser you will see a Statistics folder which holds any statistics objects relating to that table:

Statistics1

In the above example you can see some that have friendly names, these are Statistics that  are related to an actual index that has been defined on the table and they have the same name as the Index – e.g. IX_Address_StateProvinceId.

You’ll also see we have some prefixed _WA_Sys and then with some random numbers following. These are statistics objects that SQL has created automatically on columns that aren’t indexed – or at least they weren’t indexed at the time the Statistics objects were created.

You can open these up with a double-click and see what’s inside:

Statistics2

This is the General tab you open up to. You’ll see it tells you what table the Statistics are for and what column(s). There are options for adding columns and changing the order – but you never really need to do this – as well as information to tell you when the statistics were last updated, and a check box if you want to update them now.

In the details tab there’s a lot more info:

Statistics3

I don’t find this the easiest display format to work with though, so rather than delving into what everything means here let’s look at the other way you can view statistics, which is by running the following command:

DBCC SHOW_STATISTICS('Person.Address', '_WA_Sys_00000004_164452B1')

The format is straightforward, you just specify the table you are interested in the Statistics for, and the actual name of the Statistics object you want. You can see the same information as if you double-clicked on it, but the results are output in the results pane like any other query and are (I think) a lot easier to read. Allegedly there will soon be a third way in SQL Server to view Statistics as DBCC commands are considered a bit “clunky” – but we don’t know what that will look like yet.

The command outputs three resultsets:

Statistics4

This post is just an introduction to statistics – and generally you don’t need to know that much, it’s just handy to understand the basics. So let’s just run over the key bits of information you can see above:

First of all in the first recordset – otherwise know as the…

Stats Header

Rows – is the number of rows in your table

Rows Sampled – this is how many rows were sampled to generate the statistics. SQL can generate or update statsitics using sampling rather than reading all the rows. In this case you’ll see it did actually read the whole table.

Steps – If you imagine the statistics as a bar chart – this is the number of bars on the chart. Statistics objects have a maximum of 200 steps so if you have more distinct values in your column than that they will be grouped into steps.

Density – This is supposed to be the probability of a row having a particular value (calculated as 1 / Number of Distinct values in column). According to books online “This Density value is not used by the query optimizer and is displayed for backward compatibility with versions before SQL Server 2008.”  I am using SQL 2012, and this number is just plain incorrect so don’t use it…

Recordset Number 2: The Density Vector

All Density – this is the accurate version of the Density statistic described above. So your probability of a given row having a specific value is about 0.0017. That’s a little less than one in 500. I happen to know there are 575 different Cities in the table so that makes sense. Sometimes SQL will use this value to form a plan – if it knows you’re going to search this table for a specific City and it doesn’t know that City when it makes the plan, then it could guess that about 1/500th of the rows will match your criteria.

Average Length – Is what it says on the can. The average length of data in this column.

Columns – The names of any column measured in this statistics objects. You can have statistics across multiple columns but I’m not going to cover that in this post. In this case it tells us these statistics are based on the “City” column.

Recordset Number 3: The Histogram

This last recordset shows the distribution of the data, and is what you could effectively use to to draw a graph of the relative frequencies of different groups of values. Each row represents a step – or bar on the bar chart – and as mentioned above there can be a maximum of 200 steps. So you can see, a statistics object is quite lightweight, even for a massive table.

RANGE_HI_KEY – This upper limit of each step, so each step contains all the values bigger than the RANGE_HI_KEY of the last step, right up to and including this value.

RANGE_ROWS – This is how many rows in the table fall in this range – not including the number that match the RANGE_HI_KEY itself.

EQ_ROWS – The number of rows equal to the HI_KEY

DISTINCT_RANGE_ROWS – The number of different values in the range that there is data for (excluding the HI_KEY).

AVERAGE_RANGE_ROWS – The average number of rows for a given value within the range.

That’s a whistle-stop tour of the Statistics SQL Server holds on your data.

 

The algorithms that SQL then uses to calculate the number of rows for a given part of your query are pretty transparent when there’s just one column involved. If we look at the above example and let’s say you wanted to look up the rows where the City is “Abingdon” – the statistics tell us there is 1 matching row and that’s the figure SQL will use for cardinality estimation. Where a value is within a range then it will use a calculation based on the AVERAGE_RANGE_ROWS.

When there’s multiple columns involved it’s more complicated there are various algorithms and assumptions that come into play. If you’re interested in digging deeper, one very good resource is the Whitepaper on the 2014 Cardinality Estimator written by Joe Sack: https://www.sqlskills.com/blogs/joe/optimizing-your-query-plans-with-the-sql-server-2014-cardinality-estimator/

Conclusions

The main takeaway from this should just be the understand the extent – and limitations – of the information about the distribution of your data that SQL holds in the background.

If, when you’re tuning queries, you notice that the estimated row counts don’t match the actual, then this could be encouraging SQL to form a bad plan for the query. In these cases you might want to investigate what’s going on with the statistics.

Maybe your query is written in a way that it can’t use statistics effectively, one example of this can be where you store constant values in variables, then query using that variable in a WHERE clause. SQL will then optimise based on the average, rather than on your actual value.

Maybe the plan is based on one data value that has a very different cardinality to the one currently being queried. For instance when you first run a stored procedure, the plan is formed based on the parameters passed. Those parameters could have a cardinality that is quite different to those used in later executions.

Maybe the statistics are out of date and need refreshing. Stats get updated when approximately 20% of the data in the table changes, for a large table this can be a big threshold, so current statistics may not always hold good information about the data you are querying for.

Or maybe SQL is doing as good a job as it can with the information it has at its disposal. You might need to work out how you can give it a little extra help.

This is in only intended to be an introduction to Statistics objects. If you want to find out more, your favourite search engine is your friend. For the moment, I hope this makes it clear what Statistics objects are, what they are used for, and why they are important.

What’s an “Index” and how do they work?

What is an Index?

We often hear indexes explained using the analogy of an index in the back of a book. You want to find the information about “rabbits” for instance – and so you look that up in the back and find the list of pages that talk about rabbits.

That’s kind of a handy description but I find it’s not the best for understanding what’s really going on. I often describe an index as a subset or different representation of the data in your table, ordered to be of particular benefit to querying the data in a certain way. Maybe you want to query the data by Created Date – let’s have an index then in the order of that date to make it easy to find stuff.

But important to understand is…

Your table is itself an index
At least I hope so. When you create a table, hopefully you define a primary key. When you do so SQL Server will automatically define what’s called a Clustered Index based on whatever you’ve specified as the primary key. The Clustered Index (and you can only have one on a table for obvious reasons) defines the order that the data in the table will be stored on disk. The data itself is stored in a structure that is the same as any other index you define on the table (we call these non-clustered indexes) just with a few differences in exactly what is stored.

So there is no separate “table” stored on the disk in some manner, there are just indexes – the Clustered one and any non-Clustered indexes you create.

I say I hope though – because you might have decided – or have neglected – to define a primary key. Or you might have deliberately stopped SQL from creating a Clustered index. Then you are saying you don’t want to specify in what order the data is stored – “Just chuck it down there, I don’t care!” and then you have what is known as a Heap. A Heap is just a pile of data in no specified order, and whenever you want to read from it you’re going to have to scan the whole thing to find whatever you want.

We don’t like Heaps!

You might think “well, that sounds like a quick way of writing data to disk, and I’m hardly ever going to read the data from this table but I write to it a lot, so a heap sounds great”.

True, there are some edge cases where a heap is appropriate, but they really are edge cases. If you ever think a heap is the way to go, then set up a scalability test to prove it achieves what you want.

In general when you’re creating a new table, you want to think carefully about what the primary key and clustered index should be (maybe they should, but maybe they shouldn’t be the same), based on how the data is going to be written and read.

What does an index look like?
So you’ve got a table, and its physical representation on disk is as a clustered index, the picture below shows how that actually looks in logical terms:

Clustered_Index

This something called a B-tree, the B is generally agreed to stand for “balanced”, the idea being that you have to traverse the same amount of steps to find any piece of data. I don’t want to go into process this in ultimate depth, but let’s cover what’s useful.

We have a Root Node and an Intermediate Level – these (like all data in SQL Server) are made up of 8KB pages. At the root and intermediate level these just contain the Clustered Key ranges (as noted above this defaults to the Primary Key you defined) and pointers where to find those Key values on the next level down. Then we have the leaf level which contains the data itself, 8KB pages containing the complete rows of data. So if we’re looking for a particular record e.g. Id = 12, we can route quickly through the top levels to find exactly the page we want. This is much quicker than having to scan all the Leaf pages.

A non-clustered index is basically the same structure:

NonClustered_Index

You’ll see that the Root and Intermediate levels contain both the Index Key and the Clustered key. The Index key is whatever you decided to define an index on – in the above example this might be Created Date, and assuming we’re searching based on that SQL will use this index to quickly find the leaf page containing the data we want.

The main difference is at the leaf level. Here we don’t hold the complete row of data. We hold the Index key and the clustered key. If your query just wants to grab the clustered key value (often the Id) then that is fine. If it wants additional data then it can use that key value to look up the data you require in the Clustered Index. So effectively your query uses this index to gather the Clustered Index value (or values) for the rows it wants and then uses the Clustered index to get any extra data it wants.

You’ll see that can mean that we’re actually using two indexes to satisfy a query. That’s why we also have at the leaf level what you’ll see referred to as “Included Columns”. When you define an index you can choose extra columns to be stored in the leaf level pages (using the INCLUDE statement). Let’s say I’m looking at a table called Users and I want to be able to grab the email address for a user called “mcgiffen”. I can create an non-clustered index on Username and tell SQL to include the column EmailAddress. Then the query will find all the data it needs in the non-clustered index and doesn’t need to refer back to the clustered one.

When we create a non-clustered index that will serve all the needs of a particular query in this way (or set of queries) we call that a covering index for that query. A query is always going to be faster if it has a covering index.

But doesn’t that mean I’m writing the same data in multiple places?
You’ll see from the above why I refer to an index as a subset/different representation of the table itself (which hopefully is also an index). And yes – that does mean that when you’re writing to a table with multiple indexes you are actually writing some of the same data to multiple separate structures. I know… that sounds horribly inefficient.

Let’s think about a table with 100,000 rows containing information about people. Let’s say I sometimes access the data by Id value and that is the primary and clustered key. But I also access it based on the person’s name and that’s equally often so I’ve created an index on their name.

I also read from the table a lot more than I write to it, I want to access the data about a person far more often than I want to change it. Sure, when I write to it, I have to write some of the data twice, but if I didn’t have that extra index on their name, then every time I wanted to query by a person’s name I’d have to read all 100,000 rows. So I have to do one more write, but I save all those reads. That’s a pretty good trade off.

It does mean though that you want to be effective with your indexing, you don’t just want to throw every index you can think of at a table and you want to try and make indexes that can satisfy the needs of multiple queries.

From SQL Server 2008 you can have up to a thousand indexes on a table. Generally though, up to 4 seems reasonable – and if you’ve got more than 15 then that’s raising a red flag.

However there’s no hard and fast answer about how many indexes you should have, it depends on the balance of reads and writes, and it depends on the different types of access you have to the table. In general, it is good to have an index on any column that acts as a Foreign Key, and columns that will be used as the main predicates when querying a table are also good candidates to be indexed.

But be careful when adding a new index and monitor to make sure it hasn’t actually slowed things down – not just the query you’re working on but any other existing queries that use the same table as your index may change the way they work too.

SQL Puzzle 1: Magic Squares

This is the first in what I hope will be a semi-regular series of recreational puzzles where SQL can be used to find the answer. I set these puzzles on occasional Fridays in my workplace, and as I now have quite an archive I thought I should start sharing them with a wider audience.

I came up with the idea for this puzzle involving Magic Squares by accident while trying to do something different – and getting it wrong! For a moment I thought I’d had an original idea but then it was “Hang on – I’m sure I must have seen this before.”

A bit of googling and that thought was proved right – I’m just not that original. In fact the first documented instance of these appears to be the Chinese legend of Lo Shu – possibly dating back as far as 650BC. I was only beaten by the best part of three millennia…

You may have seen these before at some point – but hopefully long enough ago that you don’t know all the answers! Possibly like me they ring a bell for you but you can’t remember exactly where from.

A Magic Square is a square grid e.g. 1×1, 2×2, 3×4, 4×4…. or n x n, where numbers are placed in each square in the grid such that if you add up the numbers in each column, row, or diagonally they all add up to the same total.

Usually you use consecutive numbers starting from 1 (placing each number just once), so in the 2 x 2 grid you would place the numbers 1 to 4, in the 3 x 3 the numbers 1 to 9 – and so on.

Here’s the 1 x1 grid:

MagicSquare1

I thought I’d be generous and give you that one for free 😉

So my first question is can you find a solution (with the numbers 1 to 4) to the 2 x 2 grid:

MagicSquare2

Having explored that possibility (not for too long I hope!) the meat of this puzzle is to find solutions to the 3 x 3 grid using the numbers 1 to 9:

MagicSquare3

In fact, using SQL, can you not just find one solution, but all the possible solutions? How many are there?

Share your attempts and answers in the comments.

Have fun 🙂

Working with Data in Always Encrypted

In this post we’re going to create some encrypted columns in a table in a test database and look at some of the practicalities and limitations of working with Always Encrypted (AE).

There are actually a fair few limitations, but not because there anything wrong with the technology, rather they naturally fall out of logic of working with encrypted data.

Let’s just bash on with it and you’ll see what I mean.

Setting up a table with encrypted columns

I’ve got a database called AlwaysEncryptedTest, and it’s sitting on a remote instance – i.e. not my local machine. I’ve also got the Keys I created in the last post (Understanding Keys and Certificates with Always Encrypted). I’m now going to create a table with a couple of encrypted columns:

CREATE TABLE [dbo].[EncryptedTable](

Id INT IDENTITY(1,1) CONSTRAINT PK_EncryptedTable PRIMARY KEY CLUSTERED,

LastName [nvarchar](32) COLLATE Latin1_General_BIN2 ENCRYPTED WITH (

COLUMN_ENCRYPTION_KEY = MyColumnKey,

ENCRYPTION_TYPE = Deterministic,

ALGORITHM = 'AEAD_AES_256_CBC_HMAC_SHA_256'

) NULL,

FirstName [nvarchar](32) COLLATE Latin1_General_BIN2 ENCRYPTED WITH (

COLUMN_ENCRYPTION_KEY = MyColumnKey,

ENCRYPTION_TYPE = Randomized,

ALGORITHM = 'AEAD_AES_256_CBC_HMAC_SHA_256') NULL

);

There’s plenty to discuss there already. You might first of all ask why am I specifying a particular collation for those columns (Latin1_General_BIN2) and overriding my database collation settings. I hate people who do that! The answer is I don’t have a choice, text columns encrypted with AE must have a BIN2 collation. BIN2 collations are (apart from anything else) case-sensitive.

If you think about it, it makes sense that you can’t have a case-insensitive comparisons on an encrypted column – and therefore those collations cannot be supported. When you perform a comparison based on encrypted data, what the engine is doing is comparing one encrypted value with another. To enable a case-insensitive comparison there would have to be some deterministic pattern so that you can tell that two different encrypted values differ only by case, that would be more complicated to implement, would weaken the encryption, and isn’t supported by the algorithm used by AE. The requirement to use a BIN2 collation is driven by what will happen in practice when you compare two encrypted values looking for an exact match.

So there’s something to think about. Any searches you do against encrypted text data are going to be case sensitive. So say your application allows you to search for people by name, and you’re looking for “McGiffen” (my name) then you’re not going to find it if you type “Mcgiffen” or “mcgiffen”. For someone like myself who often see mis-captitalization of their name, as little as I care , it may be a struggle to find me if my name (as it’s Personal Identifiable Information) has been encrypted.

So how do you deal with this? I don’t know – you’ll have to think about that – maybe maintain an upper case copy for searching against and transform search terms into upper case BEFORE sending them to the database.

There are also a whole bunch of other data types that you can’t encrypt which I would imagine are to do with similar practicalities of the encryption implementation. Rather than list those here you can find that set and other limitations in this MSDN article:

https://msdn.microsoft.com/en-GB/library/mt163865.aspx?f=255&MSPPError=-2147217396

You can see in the SQL above I’ve specified that the COLUMN ENCRYPTION KEY “MyColumnKey” which I created in the previous blog post. But then for one column I’ve specified ENCRYPTION TYPE of “Deterministic” and for one it is “Random”. What’s the difference?

Well, Deterministic means that every time you encrypt a given value with a given key, the encrypted result will be the same. Randomized means it will be different. With deterministic it might be possible for someone to perform statistical analysis based on the distribution of your data to infer what some common values are – with Randomized that’s not going to happen.

However, if your encryption type is Randomized then you can’t search on that column and you certainly can’t join on it (or group or index on it either).

This makes perfect sense if you realise that SQL only knows the encrypted value of your data, so if it is to return results by comparing one encrypted value with another then the values must match for anything to be returned, for that to happen they must both have been encrypted using deterministic encryption.

Finally, in the column definition I’ve specified the encryption algorithm – the one you see is the only algorithm currently supported by AE, so just use the same one.

Inserting Encrypted Data

Now we’ve got the table let’s try inserting some data.

It should be no big surprise that if I want to run the following query from SSMS it doesn’t work:

INSERT INTO dbo.EncryptedTable (LastName, FirstName)

VALUES ('McGiffen','Matthew ');

You get a longish error message that basically tells you you’re trying to stick unencrypted data in an encrypted column:

Msg 206, Level 16, State 2, Line 24

Operand type clash: varchar is incompatible with varchar(8000) encrypted with (encryption_type = ‘DETERMINISTIC’, encryption_algorithm_name = ‘AEAD_AES_256_CBC_HMAC_SHA_256’, column_encryption_key_name = ‘MyColumnKey’, column_encryption_key_database_name = ‘AlwaysEncryptedTest’) collation_name = ‘Latin1_General_CI_AS’

To be able to insert data, there’s a few things you need to do.

First of all, this isn’t actually supported in SSMS 2016, so you need to download and install the vNext version which has the new feature “Parameterization for Always Encrypted”. Prior to this version if you want to test/play (beyond a few basics) with always encrypted data you had to write code in C# or another language.

Once you’ve got SSMS vNext, you can connect to your database. You have to enable Column Encryption for your connection – you’ll need to do this in your connection string whatever client you are querying your database from if your query might touch encrypted data – be that your application, SQLCMD or SSMS.

In SSMS you do that when you’re connecting to the instance in the “Connect to Server” dialog. Select “Options”, and go to “Additional Connection Parameters” and enter into the box “Column Encryption Setting = Enabled” – as per the image below:

AE_ConnectToServer

(One thing to note, is that SSMS will then try to use this setting for all subsequent connections you make until you go in and remove it. As the setting is only supported on SQL 2016 onward you get an error connecting to older instances, which is little bit of a pain…)

Now we’ll open a new query window against our database, and then we have to actually enable “Parameterization for Always Encrypted”. Right-click over the query and select “Query Options”, then “Advanced” and enable the option as shown below:

AE_Parametization.png

There’s an MSDN article about this feature here:

https://blogs.msdn.microsoft.com/sqlsecurity/2016/12/13/parameterization-for-always-encrypted-using-ssms-to-insert-into-update-and-filter-by-encrypted-columns/

In simple terms, it allows SSMS to perform additional parameterization on your query and encrypt any literal values that need encrypting before sending them to the database. In the previous versions of SSMS this would have just been sent plain text – and so the engine barks at you as in my error above.

The final step to get this to work is that you have to rewrite your original query to use variables rather than literal values in the insert. This helps SSMS to be able to manage the rest of the parameterization process. I’m going to show you a screen-shot of the updated code as it gets displayed in SSMS as there’s something else to notice:

AE_InsertSQL

See that my variable declarations and assignments are underlined in squiggly blue. If I hover over one of them, then I get this message:

AE_ToolTip

This shows me that the Parameterization is in action and working correctly.

If I run the insert now it works fine. Woo-hoo.

There are some implications of this if we think it through. You are only going to be able to do inserts based on values held in a variable or parameter. So you can only do one row at a time – and no, you can’t use a table variable for this. That means you can’t use value lists in an insert or merge statement, but logically, you probably don’t need to. Remember this is likely to be only sensitive personal information we’re encrypting – not reference data for which we may have scripts with lots of values. As such we would generally be inserting this one row at a time, and ideally through a stored procedure using parameters. So everything should be peachy.

It’s also logical to realise you can’t insert from one table to another unless the values in both tables are encrypted (hopefully using the same keys!). This goes back to the fact that the engine can’t encrypt or decrypt values – that has to happen in the client. The above insert only works because SSMS (our client) is doing all the work.

Let’s just run a select and check that we can see the data:

SELECT * FROM dbo.EncryptedTable

AE_Select1

Cool, this works. Note this will also work in SSMS 2016 as the query is simple enough to not require parameterization.

Let’s just have another look at the same data, but this time I’m going to run it direct from the server where the certificate for the Column Master Key doesn’t exist:

AE_Select2

So here you can only see the encrypted version of the data.

Out of interest let’s look at the full value for the LastName field:

0x0180B908F995B86C64511991CDE834DD4888A058D1B482E5E14297FFDA9D3C8E0828E3B80F3C58F98197AC73F5867E2837DD2E6C266FFAA95E38A7B08C111AF7EBEEC476EE6BBDB9704F1AC007CE1F613B

That’s just a little bit longer than “McGiffen” – so you can see there is going to be some storage overhead on your encrypted columns.

I’m just going to chuck a few extra rows into the table (including a duplicate record for me) so I can demonstrate a couple of points mentioned earlier. Here’s what we have when I’m done:

AE_Select3

Now let’s look again at the encrypted version of that:

AE_Select4

Remember in my table definition, I set LastName to be Deterministic Encryption, but FirstName to be randomized. If you look at the encrypted values above we can see some repeated values in the LastName column – even if we didn’t have the Ids to compare with the first recordset it’s pretty straightforward to work out which is “McGiffen” and which is “Rubble”.

Then look at FirstName. We know both records 1 and 2 though have a first name of “Matthew”. If we check the encrypted values in the FirstName column we can see they are both different – this is Randomized Encryption in practice.

Querying Encrypted Data With a Predicate

I’m going to go back to my encrypted (and parameterized) connection and try running some queries against those tables:

DECLARE @LastName NVARCHAR(32) = 'McGiffen';

SELECT * FROM dbo.EncryptedTable

WHERE LastName = @LastName

Results:

AE_Select5

So that worked fine, let’s just try and isolate those first two duplicates at the beginning though:

DECLARE @LastName NVARCHAR(32) = 'McGiffen';

DECLARE @FirstName NVARCHAR(32) = 'Matthew';

SELECT * FROM dbo.EncryptedTable

WHERE LastName = @LastName

AND FirstName = @FirstName;

This time I have no luck. I get the following error message that is telling me in a long-winded manner that I can’t compare against a column with Randomized encryption:

Msg 33299, Level 16, State 2, Line 35

Encryption scheme mismatch for columns/variables ‘@FirstName’, ‘FirstName’. The encryption scheme for the columns/variables is (encryption_type = ‘RANDOMIZED’, encryption_algorithm_name = ‘AEAD_AES_256_CBC_HMAC_SHA_256’, column_encryption_key_name = ‘MyColumnKey’, column_encryption_key_database_name = ‘AlwaysEncryptedTest’) and the expression near line ‘8’ expects it to be (encryption_type = ‘DETERMINISTIC’) (or weaker).

Msg 8180, Level 16, State 1, Procedure sp_describe_parameter_encryption, Line 1 [Batch Start Line 27]

Statement(s) could not be prepared.

An error occurred while executing batch. Error message is: Internal error. The format of the resultset returned by sp_describe_parameter_encryption is invalid. One of the resultsets is missing.

Of course this is exactly what we expected to happen.

What I want to get across is that this all makes perfect sense. If you understand what encryption is doing, then the limitations you face naturally arise from that. Key is understanding that the encryption occurs at the client side, so if you want to do anything in a query that requires the SQL engine to interact with an encrypted column then you are going to be limited. Mostly all it can do is compare two encrypted values and verify they are the same – and only if they are using the Deterministic setting. So, if you’re implementing AE, just think carefully about what columns you need to encrypt and particularly about how you need to interact with those columns.

What’s occurring in the background?

In the last post about certificates I mentioned the extra steps that AE has to go through when querying your data. Now we’ve got a table to play with, let’s look at a Profiler trace so we can see exactly what that involves in terms of interactions with SQL. The actions that occur within the client side libraries are a black box to us, but there is still some interesting stuff to be observed at the SQL end.

I’ll just set Profiler going using a default trace and insert another record into my table.

What we see when we do this is that there is an extra call to the database before the query is executed:

EXEC sp_describe_parameter_encryption

N'

DECLARE @LastName AS NVARCHAR (32) = @pce03080ba9844c4182a38c82216e4ad1;

DECLARE @FirstName AS NVARCHAR (32) = @pba60ad612001418aac8496e355e03892;

INSERT INTO dbo.EncryptedTable (LastName, FirstName)

VALUES (@LastName, @FirstName);

,N’@pce03080ba9844c4182a38c82216e4ad1 nvarchar(32),@pba60ad612001418aac8496e355e03892 nvarchar(32)’

This call is the client (SSMS) asking the server for details about any encryption that might be involved in the query. You’ll notice that SSMS has parameterized the query, replacing the literal values with dynamically named parameters.

The stored procedure returns two result-sets. The first contains a row for each Encryption Key involved in the query. In this case that’s just one, thus one record, so I’ve transposed it to make it easier to read:

AE_Desc_Param1

You can see this is sending back to client the encrypted value of the Column Encryption Key – and also telling it where to find the certificate of the local machine to decrypt that key.

AE_Desc_Param2

The second set describes each parameter, which of the identified Keys (from the first set) it is using, and also tells us whether it is using Deterministic or Randomized encryption (this is the “column_encryption_type” value).

Armed with that information, the client can then perform any encryption required and execute the actual query:

EXEC sp_executesql N'DECLARE @LastName AS NVARCHAR (32) = @pce03080ba9844c4182a38c82216e4ad1;

DECLARE @FirstName AS NVARCHAR (32) = @pba60ad612001418aac8496e355e03892;

INSERT INTO dbo.EncryptedTable (LastName, FirstName)

VALUES (@LastName, @FirstName);

',N'@pce03080ba9844c4182a38c82216e4ad1 nvarchar(32),@pba60ad612001418aac8496e355e03892 nvarchar(32)'

,@pce03080ba9844c4182a38c82216e4ad1=0x01F2CD73FDD15216E2D4DC89EB6DC046EBE3FFC70A967BB6A3F822B57EA840F9D60410825455391AAFE7DF161CE78F4C3D4DE65ED8FC2435115C4E0F81E49AD820

,@pba60ad612001418aac8496e355e03892=0x01C4FBE035F600CBF61B2589028863DF732973166D9752B29CBBF1C7814DF6E8BDAD0D95897D79E28884E103F350506B5465A0914AD22EE8C4BB6630FF02F39DB4

You can see that the parameterized query has now been executed via a call to sp_executesql. You can also see that those parameters have been set to the long binary values which represent the encrypted versions of the data I specified – which in this case was “Smith” and “John”.

I covered this briefly in the last post, but it’s good to understand that the extra step before executing the actual query has to happen even when there is no encryption involved as there is no way for the client to know which columns are encrypted and which are not. As such, once you are working with AE and have the Column Encryption Setting enabled for your connections, there will be some overhead on pretty much all your queries. Important to note though – if you are doing this via .NET code then all this extra process just happens automatically for you in background. From your point of view, you just code as normal.

Encrypting Existing Data

This post wouldn’t be complete unless I touched on this subject at least briefly. So that’s what I’ll do. Touch on it briefly.

So, how do you change an existing column containing data, so that it is encrypted?

If you’ve got this far then you can probably figure out for yourself that it’s not going to be as simple as an ALTER TABLE statement. Encryption is a client side function, so if you want to encrypt some data then that data’s got to go via a client.

There are various ways you can do that. You could hand crank something and do it that way. Generally your best bet is going to be a process that involves creating a new table with the encryption set up as you want it, copying over data from the old table into the new, than dropping the old table and renaming the new one. So no, this is not going to be a particularly online activity.

It can be a fairly straightforward one though as Microsoft has provided the tool to do that for you via SSIS using the SQL Server Import Export Wizard.

Here’s a nice MSDN post that takes you through that process

https://blogs.msdn.microsoft.com/sqlsecurity/2015/07/28/encrypting-existing-data-with-always-encrypted/

Statistics Parser

In the last post we looked at using the STATISTICS IO and STATISTICS TIME commands to measure query performance.

If you’ve started using these, you may notice that once you start to troubleshoot longer scripts or stored procedures that reference lots of tables or have multiple queries, you start getting lots of output. Sometimes it can sprawl across several screens worth. That can make it a a bit of a pain to total up the numbers, or to quickly see if a change you have made has had a positive effect.

Here’s the STATISTICS output from a troublesome query I was looking at recently:
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 0 ms.
SQL Server parse and compile time:
CPU time = 140 ms, elapsed time = 140 ms.

SQL Server Execution Times:
CPU time = 0 ms,  elapsed time = 0 ms.
(188 row(s) affected)

Table ‘ParameterValue’. Scan count 14, logical reads 2350, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘ParameterDefinition’. Scan count 8, logical reads 456, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘Worktable’. Scan count 8, logical reads 105600, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘Workfile’. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘ParameterDataType’. Scan count 0, logical reads 376, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 2371 ms,  elapsed time = 655 ms.

SQL Server Execution Times:
CPU time = 2511 ms,  elapsed time = 797 ms.

Now that’s not too bad, and it’s fairly simple for me to add up the numbers of reads. But you also can find yourself copying and pasting the output from several iterations of changes into notepad documents or query windows so you can compare them later, and it can become a bit of a pain to keep track of.

That’s what I used to do, until I discovered the STATISTICS parser tool.

This a free web-based tool to help you keep track of your STATISTICS output, which will also format the results nicely for you and produce totals. It’s very simple and easy to use:

http://statisticsparser.com/
StatisticsParser1
You simply paste the STATISTICS output into the big box and hit the “Parse” button. You can also enter a Page Title that will become the name of the Tab in your browser to make it easier to keep track of different sets of output.

The parser will turn the set of output for each query into an easy to read table and then will also provide a Totals table that shows you (not surprisingly) the totals across all your queries. Here’s the output above formatted via the tool (I am just showing the final totals):

StatisticsParser2

Personally I find that significantly easier to read!

Watch out though, the total CPU time can be double-counted. This is because the STATISTICS TIME output can sometimes include a total itself. If I look at just the TIME sections from the output above I have:

SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 0 ms.
SQL Server parse and compile time:
CPU time = 140 ms, elapsed time = 140 ms.

SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.

SQL Server Execution Times:
CPU time = 2371 ms, elapsed time = 655 ms.

SQL Server Execution Times:
CPU time = 2511 ms, elapsed time = 797 ms.

If you look carefully at that you’ll see that the last set of entries (CPU time = 2511ms, elapsed time = 797ms) are totals of the other numbers. If you then look at the Statistics Parser output above you’ll see that it has added those to the previous outputs to give a total CPU time of 5.022 seconds (5,022ms).

Still we’re mainly interested in getting the IO down so I ignore that for now. The main thing I’m interested in is the total number of reads:
108,782

I make my changes to the stored procedure, execute it and get the STATISTICS output again:

SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 0 ms.
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 0 ms.

SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.

SQL Server Execution Times:

CPU time = 16 ms, elapsed time = 0 ms.

SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 1 ms.

(189 row(s) affected)

SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.

(188 row(s) affected)
Table ‘Workfile’. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘ParameterDefinition’. Scan count 2, logical reads 396, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘ParameterValue’. Scan count 3, logical reads 708, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘#ParameterNames_____________________________________________________________________________________________________0000000000CB’. Scan count 2, logical reads 718, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table ‘ParameterDataType’. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 78 ms, elapsed time = 80 ms.

SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.

SQL Server Execution Times:
CPU time = 94 ms, elapsed time = 83 ms.

You can see my output is getting longer and starting to become more challenging to read – I broke the original query down into a few simpler steps, but each produces its own output. No worries, I open up a new tab in my browser and go to a second instance of the Statistics Parser. I give this one the Page Title “After”. I paste in my output and I hit parse:

StatisticsParser3

You can now see I have a “Before” and an “After” tab so it’s easy to flick back and forth if I’ve forgotten my totals.

I can now easily see that my total number of reads – previously 108,782 – is down to 1,824. That’s about 50 times better. The CPU time is down by a factor of about 25, not as good as the reads, but I call that good enough and proclaim that tuning task complete!