Row-count Estimates when there are no Statistics

I was discussing Cardinality Estimation with a colleague recently and the question came up, what cardinality does SQL Server use if you’re selecting from a column where there are no statistics available? I’ve discovered there are a few algorithms in play depending on how you’re querying the table. In this post we’ll look at where we have a predicate looking for a fixed value.

(If you want the short answer, it’s the fourth root of n cubed before SQL 2014 and the square root of n after that.)

This scenario can occur if you have AUTO CREATE STATISTICS turned off for your database, which we don’t recommend you do, but which you might choose to do anyway, and if you query a table with a predicate against a column with no index defined against it.
Let’s look at example. I’m using SQL Server 2012 for this investigation, and to start with am querying the AdventureWorks2012 database.
I’ve taken the following preparatory steps:

  • Set AUTO CREATE STATISTICS OFF for the database
  • Remove the Index on the LastName column for the Person.Person table
  • Removed any ad-hoc statistics that existed against the table

Then I run a simple query, with the Actual Execution Plan turned on:

SELECT *
FROM Person.Person p
WHERE p.LastName = 'Fox';

I only get one result out as there is only one Fox (Ms. Dorothy J.). Let’s look at the execution plan:

RowCountNoStatistics1

A clustered index scan as we might expect as I’ve removed any useful indexes from the table. You’ll notice there is a warning. If we view the tooltip you’ll see SQL warns us about the lack of statistics:

RowCountNoStatistics2

If we look at the estimated and actual row-counts we’ll see how that has affected us:

RowCountNoStatistics3

In the absence of any useful information – it knows the number of rows in the table but that is about it – SQL has estimated that there will be 1680 Foxes in the table. A bit of playing shows that we get the same estimate whatever value we search for.

If I turn AUTO CREATE STATISTICS on and run the query again then SQL generate a Statistics object against the LastName column and comes up with an estimate of 2.3 rows – which is a lot closer.

This matters a lot once you start running more complicated queries. The incorrect estimate is likely to affect the choice of plan that the optimizer makes, and may also affect the amount of memory it requests in order to run the query. Let’s look at a quick example of how the plan changes if we join the above query to another table.

First, without statistics (so I have to turn AUTO CREATE off again, and remove the statistics that got created):

SET STATISTICS IO ON;

SELECT e.EmailAddress
FROM Person.Person p
INNER JOIN Person.EmailAddress e
    ON p.BusinessEntityID = e.BusinessEntityID
WHERE LastName = 'Fox';

Here’s the execution plan:

RowCountNoStatistics4

You can see I’ve got a Merge Join as SQL thinks I’m expecting 1680 rows from the top table. A Nested Loops join would generally be better when I only expect one or two rows from that table.

I’ve also captured the IO so I can see how expensive the query was:

Table ‘EmailAddress’. Scan count 1, logical reads 10, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table ‘Person’. Scan count 1, logical reads 3818, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Let’s look at the behaviour of the same query with statistics creation enabled:

RowCountNoStatistics5

You can see we now have the desired Nested Loops join and the Clustered Index Scan on the EmailAddress table has been changed to a Seek.

The IO output is below:

Table ‘EmailAddress’. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table ‘Person’. Scan count 1, logical reads 3818, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

There’s not much difference in the overall IO, but you can see the Reads for the EmailAddress table have dropped from 10 to 2 due to the change from the Scan to the Seek. If the table was a lot bigger then we could see a large difference here.

So where does that estimate come from?

I thought I’d have a play and see if I could work out how SQL decided on that estimate of 1680 rows. I did some googling and found a suggestion that it might be a straight 9% of the total number of rows in the table, but that doesn’t quite add up and when I compared the same query pattern against a few tables I found I got a different ratio depending on the amount of rows in the table.

So I pumped rows incrementally into a fresh table and looked at the estimate and what the ratio was as the number of rows increased. Here’s my SQL for that:

--Create a Horrible Heap for my testing  
CREATE TABLE dbo.TestStats(TestVal INT NOT NULL, TestText VARCHAR(255) NULL);

--Insert a bunch of rows using dodgy old-style cross joins
INSERT INTO TestStats
SELECT TOP 1 --Amount of rows I'm going to insert
    1 , 'blah'
FROM sys.objects a, sys.objects b, sys.objects c, sys.objects d, sys.objects e;

--Clear the plan cache so SQL generated a new estimate  
DBCC freeproccache;

--Query the table, then I go check the execution plan generated 
--for the estimated number of rows
SELECT * FROM dbo.TestStats
WHERE TestVal = 1;   

(One thing to note was that I got the same answers whether I was querying the text column or the integer column – SQL seems to use the same algorithm for both.)

I started to notice a pattern quite quickly, that the ratio halved when the number of rows went up by a factor of 16. I then restarted my test, targeting my row-counts to be where the estimated number of rows would be a nice round number. You can see that in the table below:

RowCountNoStatistics6

I then attempted to work out a formula for that. Rather than take you through the shoddy process of mathematics that led me to an answer, I’ll just tell you that the formula came out as:

RowCountNoStatistics6b

Where e is the estimated number of rows for a given predicate value, and n is the total number of rows in the table. I checked that against the full set of results I’d gathered and it held true across all values of n I’d tested.

To check it finally against my original query – the Person.Person table had 19,972 rows. I put that through the calculator with the formula and get 1680.027. If we look back at the original estimate you’ll see that SQL stated 1680.03  – so that is all good.

As I mentioned earlier I was using SQL Server 2012 for this test, and a new Cardinality Estimator came into effect in SQL 2014. So I thought I’d run the test again with SQL 2016 and see if the results changed:

RowCountNoStatistics7

We can see the estimated rows drop off a lot quicker here. Clearly Microsoft have decided to lower the estimate. Actually it is now just the square root of the total number of rows.

RowCountNoStatistics8

Hopefully you’re not in the scenario where you regularly have queries running without the appropriate statistics to support them. The above comparison though shows us that if you have such a query its behaviour could dramatically change if you upgrade your SQL Server version to 2104 or higher. It could become better or it could become a lot worse.

There are a lot of changes like this that came in with the new version of the Cardinality Estimator in 2014. Places where underlying assumptions have been adjusted to make better guesses about the number of rows that will be returned by an operator. But they are still guesses based on the same information – there is no new data being captured in the Statistics to better inform the process. Of course Microsoft has made these changes to try and better model data out in the wild – but they are still fixed assumptions, which means sometimes they will be better and sometimes they will be worse.

One thing I should re-iterate is that these formulae we’ve discovered above are for a fairly specific querying pattern. There’s no guarantee that the calculation will be the same for a similar – but different query. It might be interesting to explore that further in a later post.

Also there may be other information in your database – such as constraints – that SQL can use to educate its guesses.

The main takeaway from all of this though, should of course be:

MAKE SURE AUTO CREATE STATISTICS IS TURNED ON FOR YOUR DATABASES!

Statistics and Cardinality Estimation

There is an updated version of this post here.

Cardinality

This is a term originally from Mathematics, generally defined as “The number of objects in a given set or grouping”. In SQL we’re continually dealing with sets so this becomes a very relevant topic, which in our context is just the “number of rows”.

When you have a query across multiple tables there any many ways in which SQL Server could decide to physically go about getting you the results you want. It could query and join the tables in any order and it could use different methods for matching records from the various tables you have joined together. It also needs to know how much memory to allocate for the operation – and to do that it needs to have an idea of the amount of data generated at each stage of processing.

A lot of this requires cardinality estimation, and SQL Server uses something called Statistics objects to perform that calculation.

Let’s look at a simple example:
SELECT *
FROM Person.Person p
INNER JOIN Person.[Address] a
ON p.AddressId = a.AddressId
WHERE p.LastName = 'Smith'
AND a.City = 'Bristol'

When it comes to gathering the results for this query there are a number of ways the database engine could go about it. For instance:

a) It could find all the records in the Person table with a LastName of Smith, look each of their addresses up and return only the ones who live in Bristol.
b) It could find all the Addresses in Bristol, look up the people associated with each address and return only the ones called Smith.
c) It could grab the set of people called Smith from the People table, grab all the addresses in Bristol, and only finally match up the records between those two sets.

Which of those operations is going to be most efficient depends very much on the number of records returned from each table. Let’s say that we have a million people called Smith, but there’s only one address in our whole database that is in Bristol (and let’s say that address does actually belong to someone called Smith).

In the first method above I would grab the million Smiths and then look their address up one by one in the address table until I found the one that lived in Bristol.

If I used method b) though, I would find the one matching address first, and then I would simply look up the owner of that address. Clearly in this rather contrived example, that’s going to be a lot quicker. So if SQL knows ahead of time roughly how many records to expect from each part of the query, hopefully it can make a good decision about how to get the data.

But how can it work out how many rows will be returned without actually running the query?

Statistics

That’s where statistics objects come in. SQL Server maintains in the background data that equates to a histogram showing the distribution of the data in certain columns within a table. It does this any time you create an index – statistics will be generated on the columns the index is defined against, but it also does it any time it determines that it would be useful. So if SQL encounters a Where clause on Person.LastName – and that column isn’t involved in a useful index, SQL is likely to generate a statistics object to tell it about the distribution of data in that column.

I say “likely to” because it actually depends on the settings of your SQL instance. Server configuration is beyond the scope of this post but suffice to say you can let SQL automatically create Statistics objects – or not. You can let it automatically update them when the data has changed by more than a given threshold – or not. And you can specify whether updates to statistics should happen asynchronously or synchronously – i.e. in the latter case if your query determines that statistics needs updating then it will kick that off and wait until the update is complete before processing the query.

It’s generally recommended that auto creation and updating is on, and async updating is off.

Viewing Statistics Objects
Let’s have a look at some actual statistics and see what they hold. There are a couple of ways of doing this, the first is through SSMS. If you look under a table in the object browser you will see a Statistics folder which holds any statistics objects relating to that table:

Statistics1

In the above example you can see some that have friendly names, these are Statistics that  are related to an actual index that has been defined on the table and they have the same name as the Index – e.g. IX_Address_StateProvinceId.

You’ll also see we have some prefixed _WA_Sys and then with some random numbers following. These are statistics objects that SQL has created automatically on columns that aren’t indexed – or at least they weren’t indexed at the time the Statistics objects were created.

You can open these up with a double-click and see what’s inside:

Statistics2

This is the General tab you open up to. You’ll see it tells you what table the Statistics are for and what column(s). There are options for adding columns and changing the order – but you never really need to do this – as well as information to tell you when the statistics were last updated, and a check box if you want to update them now.

In the details tab there’s a lot more info:

Statistics3

I don’t find this the easiest display format to work with though, so rather than delving into what everything means here let’s look at the other way you can view statistics, which is by running the following command:

DBCC SHOW_STATISTICS('Person.Address', '_WA_Sys_00000004_164452B1')

The format is straightforward, you just specify the table you are interested in the Statistics for, and the actual name of the Statistics object you want. You can see the same information as if you double-clicked on it, but the results are output in the results pane like any other query and are (I think) a lot easier to read. Allegedly there will soon be a third way in SQL Server to view Statistics as DBCC commands are considered a bit “clunky” – but we don’t know what that will look like yet.

The command outputs three resultsets:

Statistics4

This post is just an introduction to statistics – and generally you don’t need to know that much, it’s just handy to understand the basics. So let’s just run over the key bits of information you can see above:

First of all in the first recordset – otherwise know as the…

Stats Header

Rows – is the number of rows in your table

Rows Sampled – this is how many rows were sampled to generate the statistics. SQL can generate or update statsitics using sampling rather than reading all the rows. In this case you’ll see it did actually read the whole table.

Steps – If you imagine the statistics as a bar chart – this is the number of bars on the chart. Statistics objects have a maximum of 200 steps so if you have more distinct values in your column than that they will be grouped into steps.

Density – This is supposed to be the probability of a row having a particular value (calculated as 1 / Number of Distinct values in column). According to books online “This Density value is not used by the query optimizer and is displayed for backward compatibility with versions before SQL Server 2008.”  I am using SQL 2012, and this number is just plain incorrect so don’t use it…

Recordset Number 2: The Density Vector

All Density – this is the accurate version of the Density statistic described above. So your probability of a given row having a specific value is about 0.0017. That’s a little less than one in 500. I happen to know there are 575 different Cities in the table so that makes sense. Sometimes SQL will use this value to form a plan – if it knows you’re going to search this table for a specific City and it doesn’t know that City when it makes the plan, then it could guess that about 1/500th of the rows will match your criteria.

Average Length – Is what it says on the can. The average length of data in this column.

Columns – The names of any column measured in this statistics objects. You can have statistics across multiple columns but I’m not going to cover that in this post. In this case it tells us these statistics are based on the “City” column.

Recordset Number 3: The Histogram

This last recordset shows the distribution of the data, and is what you could effectively use to to draw a graph of the relative frequencies of different groups of values. Each row represents a step – or bar on the bar chart – and as mentioned above there can be a maximum of 200 steps. So you can see, a statistics object is quite lightweight, even for a massive table.

RANGE_HI_KEY – This upper limit of each step, so each step contains all the values bigger than the RANGE_HI_KEY of the last step, right up to and including this value.

RANGE_ROWS – This is how many rows in the table fall in this range – not including the number that match the RANGE_HI_KEY itself.

EQ_ROWS – The number of rows equal to the HI_KEY

DISTINCT_RANGE_ROWS – The number of different values in the range that there is data for (excluding the HI_KEY).

AVERAGE_RANGE_ROWS – The average number of rows for a given value within the range.

That’s a whistle-stop tour of the Statistics SQL Server holds on your data.

 

The algorithms that SQL then uses to calculate the number of rows for a given part of your query are pretty transparent when there’s just one column involved. If we look at the above example and let’s say you wanted to look up the rows where the City is “Abingdon” – the statistics tell us there is 1 matching row and that’s the figure SQL will use for cardinality estimation. Where a value is within a range then it will use a calculation based on the AVERAGE_RANGE_ROWS.

When there’s multiple columns involved it’s more complicated there are various algorithms and assumptions that come into play. If you’re interested in digging deeper, one very good resource is the Whitepaper on the 2014 Cardinality Estimator written by Joe Sack: https://www.sqlskills.com/blogs/joe/optimizing-your-query-plans-with-the-sql-server-2014-cardinality-estimator/

Conclusions

The main takeaway from this should just be the understand the extent – and limitations – of the information about the distribution of your data that SQL holds in the background.

If, when you’re tuning queries, you notice that the estimated row counts don’t match the actual, then this could be encouraging SQL to form a bad plan for the query. In these cases you might want to investigate what’s going on with the statistics.

Maybe your query is written in a way that it can’t use statistics effectively, one example of this can be where you store constant values in variables, then query using that variable in a WHERE clause. SQL will then optimise based on the average, rather than on your actual value.

Maybe the plan is based on one data value that has a very different cardinality to the one currently being queried. For instance when you first run a stored procedure, the plan is formed based on the parameters passed. Those parameters could have a cardinality that is quite different to those used in later executions.

Maybe the statistics are out of date and need refreshing. Stats get updated when approximately 20% of the data in the table changes, for a large table this can be a big threshold, so current statistics may not always hold good information about the data you are querying for.

Or maybe SQL is doing as good a job as it can with the information it has at its disposal. You might need to work out how you can give it a little extra help.

This is in only intended to be an introduction to Statistics objects. If you want to find out more, your favourite search engine is your friend. For the moment, I hope this makes it clear what Statistics objects are, what they are used for, and why they are important.

What’s an “Index” and how do they work?

What is an Index?

We often hear indexes explained using the analogy of an index in the back of a book. You want to find the information about “rabbits” for instance – and so you look that up in the back and find the list of pages that talk about rabbits.

That’s kind of a handy description but I find it’s not the best for understanding what’s really going on. I often describe an index as a subset or different representation of the data in your table, ordered to be of particular benefit to querying the data in a certain way. Maybe you want to query the data by Created Date – let’s have an index then in the order of that date to make it easy to find stuff.

But important to understand is…

Your table is itself an index
At least I hope so. When you create a table, hopefully you define a primary key. When you do so SQL Server will automatically define what’s called a Clustered Index based on whatever you’ve specified as the primary key. The Clustered Index (and you can only have one on a table for obvious reasons) defines the order that the data in the table will be stored on disk. The data itself is stored in a structure that is the same as any other index you define on the table (we call these non-clustered indexes) just with a few differences in exactly what is stored.

So there is no separate “table” stored on the disk in some manner, there are just indexes – the Clustered one and any non-Clustered indexes you create.

I say I hope though – because you might have decided – or have neglected – to define a primary key. Or you might have deliberately stopped SQL from creating a Clustered index. Then you are saying you don’t want to specify in what order the data is stored – “Just chuck it down there, I don’t care!” and then you have what is known as a Heap. A Heap is just a pile of data in no specified order, and whenever you want to read from it you’re going to have to scan the whole thing to find whatever you want.

We don’t like Heaps!

You might think “well, that sounds like a quick way of writing data to disk, and I’m hardly ever going to read the data from this table but I write to it a lot, so a heap sounds great”.

True, there are some edge cases where a heap is appropriate, but they really are edge cases. If you ever think a heap is the way to go, then set up a scalability test to prove it achieves what you want.

In general when you’re creating a new table, you want to think carefully about what the primary key and clustered index should be (maybe they should, but maybe they shouldn’t be the same), based on how the data is going to be written and read.

What does an index look like?
So you’ve got a table, and its physical representation on disk is as a clustered index, the picture below shows how that actually looks in logical terms:

Clustered_Index

This something called a B-tree, the B is generally agreed to stand for “balanced”, the idea being that you have to traverse the same amount of steps to find any piece of data. I don’t want to go into process this in ultimate depth, but let’s cover what’s useful.

We have a Root Node and an Intermediate Level – these (like all data in SQL Server) are made up of 8KB pages. At the root and intermediate level these just contain the Clustered Key ranges (as noted above this defaults to the Primary Key you defined) and pointers where to find those Key values on the next level down. Then we have the leaf level which contains the data itself, 8KB pages containing the complete rows of data. So if we’re looking for a particular record e.g. Id = 12, we can route quickly through the top levels to find exactly the page we want. This is much quicker than having to scan all the Leaf pages.

A non-clustered index is basically the same structure:

NonClustered_Index

You’ll see that the Root and Intermediate levels contain both the Index Key and the Clustered key. The Index key is whatever you decided to define an index on – in the above example this might be Created Date, and assuming we’re searching based on that SQL will use this index to quickly find the leaf page containing the data we want.

The main difference is at the leaf level. Here we don’t hold the complete row of data. We hold the Index key and the clustered key. If your query just wants to grab the clustered key value (often the Id) then that is fine. If it wants additional data then it can use that key value to look up the data you require in the Clustered Index. So effectively your query uses this index to gather the Clustered Index value (or values) for the rows it wants and then uses the Clustered index to get any extra data it wants.

You’ll see that can mean that we’re actually using two indexes to satisfy a query. That’s why we also have at the leaf level what you’ll see referred to as “Included Columns”. When you define an index you can choose extra columns to be stored in the leaf level pages (using the INCLUDE statement). Let’s say I’m looking at a table called Users and I want to be able to grab the email address for a user called “mcgiffen”. I can create an non-clustered index on Username and tell SQL to include the column EmailAddress. Then the query will find all the data it needs in the non-clustered index and doesn’t need to refer back to the clustered one.

When we create a non-clustered index that will serve all the needs of a particular query in this way (or set of queries) we call that a covering index for that query. A query is always going to be faster if it has a covering index.

But doesn’t that mean I’m writing the same data in multiple places?
You’ll see from the above why I refer to an index as a subset/different representation of the table itself (which hopefully is also an index). And yes – that does mean that when you’re writing to a table with multiple indexes you are actually writing some of the same data to multiple separate structures. I know… that sounds horribly inefficient.

Let’s think about a table with 100,000 rows containing information about people. Let’s say I sometimes access the data by Id value and that is the primary and clustered key. But I also access it based on the person’s name and that’s equally often so I’ve created an index on their name.

I also read from the table a lot more than I write to it, I want to access the data about a person far more often than I want to change it. Sure, when I write to it, I have to write some of the data twice, but if I didn’t have that extra index on their name, then every time I wanted to query by a person’s name I’d have to read all 100,000 rows. So I have to do one more write, but I save all those reads. That’s a pretty good trade off.

It does mean though that you want to be effective with your indexing, you don’t just want to throw every index you can think of at a table and you want to try and make indexes that can satisfy the needs of multiple queries.

From SQL Server 2008 you can have up to a thousand indexes on a table. Generally though, up to 4 seems reasonable – and if you’ve got more than 15 then that’s raising a red flag.

However there’s no hard and fast answer about how many indexes you should have, it depends on the balance of reads and writes, and it depends on the different types of access you have to the table. In general, it is good to have an index on any column that acts as a Foreign Key, and columns that will be used as the main predicates when querying a table are also good candidates to be indexed.

But be careful when adding a new index and monitor to make sure it hasn’t actually slowed things down – not just the query you’re working on but any other existing queries that use the same table as your index may change the way they work too.

SQL Puzzle 1: Magic Squares

This is the first in what I hope will be a semi-regular series of recreational puzzles where SQL can be used to find the answer. I set these puzzles on occasional Fridays in my workplace, and as I now have quite an archive I thought I should start sharing them with a wider audience.

I came up with the idea for this puzzle involving Magic Squares by accident while trying to do something different – and getting it wrong! For a moment I thought I’d had an original idea but then it was “Hang on – I’m sure I must have seen this before.”

A bit of googling and that thought was proved right – I’m just not that original. In fact the first documented instance of these appears to be the Chinese legend of Lo Shu – possibly dating back as far as 650BC. I was only beaten by the best part of three millennia…

You may have seen these before at some point – but hopefully long enough ago that you don’t know all the answers! Possibly like me they ring a bell for you but you can’t remember exactly where from.

A Magic Square is a square grid e.g. 1×1, 2×2, 3×4, 4×4…. or n x n, where numbers are placed in each square in the grid such that if you add up the numbers in each column, row, or diagonally they all add up to the same total.

Usually you use consecutive numbers starting from 1 (placing each number just once), so in the 2 x 2 grid you would place the numbers 1 to 4, in the 3 x 3 the numbers 1 to 9 – and so on.

Here’s the 1 x1 grid:

MagicSquare1

I thought I’d be generous and give you that one for free 😉

So my first question is can you find a solution (with the numbers 1 to 4) to the 2 x 2 grid:

MagicSquare2

Having explored that possibility (not for too long I hope!) the meat of this puzzle is to find solutions to the 3 x 3 grid using the numbers 1 to 9:

MagicSquare3

In fact, using SQL, can you not just find one solution, but all the possible solutions? How many are there?

Share your attempts and answers in the comments.

Have fun 🙂

Working with Data in Always Encrypted

In this post we’re going to create some encrypted columns in a table in a test database and look at some of the practicalities and limitations of working with Always Encrypted (AE).

There are actually a fair few limitations, but not because there anything wrong with the technology, rather they naturally fall out of logic of working with encrypted data.

Let’s just bash on with it and you’ll see what I mean.

Setting up a table with encrypted columns

I’ve got a database called AlwaysEncryptedTest, and it’s sitting on a remote instance – i.e. not my local machine. I’ve also got the Keys I created in the last post (Understanding Keys and Certificates with Always Encrypted). I’m now going to create a table with a couple of encrypted columns:

CREATE TABLE [dbo].[EncryptedTable](

Id INT IDENTITY(1,1) CONSTRAINT PK_EncryptedTable PRIMARY KEY CLUSTERED,

LastName [nvarchar](32) COLLATE Latin1_General_BIN2 ENCRYPTED WITH (

COLUMN_ENCRYPTION_KEY = MyColumnKey,

ENCRYPTION_TYPE = Deterministic,

ALGORITHM = 'AEAD_AES_256_CBC_HMAC_SHA_256'

) NULL,

FirstName [nvarchar](32) COLLATE Latin1_General_BIN2 ENCRYPTED WITH (

COLUMN_ENCRYPTION_KEY = MyColumnKey,

ENCRYPTION_TYPE = Randomized,

ALGORITHM = 'AEAD_AES_256_CBC_HMAC_SHA_256') NULL

);

There’s plenty to discuss there already. You might first of all ask why am I specifying a particular collation for those columns (Latin1_General_BIN2) and overriding my database collation settings. I hate people who do that! The answer is I don’t have a choice, text columns encrypted with AE must have a BIN2 collation. BIN2 collations are (apart from anything else) case-sensitive.

If you think about it, it makes sense that you can’t have a case-insensitive comparisons on an encrypted column – and therefore those collations cannot be supported. When you perform a comparison based on encrypted data, what the engine is doing is comparing one encrypted value with another. To enable a case-insensitive comparison there would have to be some deterministic pattern so that you can tell that two different encrypted values differ only by case, that would be more complicated to implement, would weaken the encryption, and isn’t supported by the algorithm used by AE. The requirement to use a BIN2 collation is driven by what will happen in practice when you compare two encrypted values looking for an exact match.

So there’s something to think about. Any searches you do against encrypted text data are going to be case sensitive. So say your application allows you to search for people by name, and you’re looking for “McGiffen” (my name) then you’re not going to find it if you type “Mcgiffen” or “mcgiffen”. For someone like myself who often see mis-captitalization of their name, as little as I care , it may be a struggle to find me if my name (as it’s Personal Identifiable Information) has been encrypted.

So how do you deal with this? I don’t know – you’ll have to think about that – maybe maintain an upper case copy for searching against and transform search terms into upper case BEFORE sending them to the database.

There are also a whole bunch of other data types that you can’t encrypt which I would imagine are to do with similar practicalities of the encryption implementation. Rather than list those here you can find that set and other limitations in this MSDN article:

https://msdn.microsoft.com/en-GB/library/mt163865.aspx?f=255&MSPPError=-2147217396

You can see in the SQL above I’ve specified that the COLUMN ENCRYPTION KEY “MyColumnKey” which I created in the previous blog post. But then for one column I’ve specified ENCRYPTION TYPE of “Deterministic” and for one it is “Random”. What’s the difference?

Well, Deterministic means that every time you encrypt a given value with a given key, the encrypted result will be the same. Randomized means it will be different. With deterministic it might be possible for someone to perform statistical analysis based on the distribution of your data to infer what some common values are – with Randomized that’s not going to happen.

However, if your encryption type is Randomized then you can’t search on that column and you certainly can’t join on it (or group or index on it either).

This makes perfect sense if you realise that SQL only knows the encrypted value of your data, so if it is to return results by comparing one encrypted value with another then the values must match for anything to be returned, for that to happen they must both have been encrypted using deterministic encryption.

Finally, in the column definition I’ve specified the encryption algorithm – the one you see is the only algorithm currently supported by AE, so just use the same one.

Inserting Encrypted Data

Now we’ve got the table let’s try inserting some data.

It should be no big surprise that if I want to run the following query from SSMS it doesn’t work:

INSERT INTO dbo.EncryptedTable (LastName, FirstName)


VALUES ('McGiffen','Matthew ');

You get a longish error message that basically tells you you’re trying to stick unencrypted data in an encrypted column:

Msg 206, Level 16, State 2, Line 24

Operand type clash: varchar is incompatible with varchar(8000) encrypted with (encryption_type = ‘DETERMINISTIC’, encryption_algorithm_name = ‘AEAD_AES_256_CBC_HMAC_SHA_256’, column_encryption_key_name = ‘MyColumnKey’, column_encryption_key_database_name = ‘AlwaysEncryptedTest’) collation_name = ‘Latin1_General_CI_AS’

To be able to insert data, there’s a few things you need to do.

First of all, this isn’t actually supported in SSMS 2016, so you need to download and install the vNext version which has the new feature “Parameterization for Always Encrypted”. Prior to this version if you want to test/play (beyond a few basics) with always encrypted data you had to write code in C# or another language.

Once you’ve got SSMS vNext, you can connect to your database. You have to enable Column Encryption for your connection – you’ll need to do this in your connection string whatever client you are querying your database from if your query might touch encrypted data – be that your application, SQLCMD or SSMS.

In SSMS you do that when you’re connecting to the instance in the “Connect to Server” dialog. Select “Options”, and go to “Additional Connection Parameters” and enter into the box “Column Encryption Setting = Enabled” – as per the image below:

AE_ConnectToServer

(One thing to note, is that SSMS will then try to use this setting for all subsequent connections you make until you go in and remove it. As the setting is only supported on SQL 2016 onward you get an error connecting to older instances, which is little bit of a pain…)

Now we’ll open a new query window against our database, and then we have to actually enable “Parameterization for Always Encrypted”. Right-click over the query and select “Query Options”, then “Advanced” and enable the option as shown below:

AE_Parametization.png

There’s an MSDN article about this feature here:

https://blogs.msdn.microsoft.com/sqlsecurity/2016/12/13/parameterization-for-always-encrypted-using-ssms-to-insert-into-update-and-filter-by-encrypted-columns/

In simple terms, it allows SSMS to perform additional parameterization on your query and encrypt any literal values that need encrypting before sending them to the database. In the previous versions of SSMS this would have just been sent plain text – and so the engine barks at you as in my error above.

The final step to get this to work is that you have to rewrite your original query to use variables rather than literal values in the insert. This helps SSMS to be able to manage the rest of the parameterization process. I’m going to show you a screen-shot of the updated code as it gets displayed in SSMS as there’s something else to notice:

AE_InsertSQL

See that my variable declarations and assignments are underlined in squiggly blue. If I hover over one of them, then I get this message:

AE_ToolTip

This shows me that the Parameterization is in action and working correctly.

If I run the insert now it works fine. Woo-hoo.

There are some implications of this if we think it through. You are only going to be able to do inserts based on values held in a variable or parameter. So you can only do one row at a time – and no, you can’t use a table variable for this. That means you can’t use value lists in an insert or merge statement, but logically, you probably don’t need to. Remember this is likely to be only sensitive personal information we’re encrypting – not reference data for which we may have scripts with lots of values. As such we would generally be inserting this one row at a time, and ideally through a stored procedure using parameters. So everything should be peachy.

It’s also logical to realise you can’t insert from one table to another unless the values in both tables are encrypted (hopefully using the same keys!). This goes back to the fact that the engine can’t encrypt or decrypt values – that has to happen in the client. The above insert only works because SSMS (our client) is doing all the work.

Let’s just run a select and check that we can see the data:

SELECT * FROM dbo.EncryptedTable

AE_Select1

Cool, this works. Note this will also work in SSMS 2016 as the query is simple enough to not require parameterization.

Let’s just have another look at the same data, but this time I’m going to run it direct from the server where the certificate for the Column Master Key doesn’t exist:

AE_Select2

So here you can only see the encrypted version of the data.

Out of interest let’s look at the full value for the LastName field:

0x0180B908F995B86C64511991CDE834DD4888A058D1B482E5E14297FFDA9D3C8E0828E3B80F3C58F98197AC73F5867E2837DD2E6C266FFAA95E38A7B08C111AF7EBEEC476EE6BBDB9704F1AC007CE1F613B

That’s just a little bit longer than “McGiffen” – so you can see there is going to be some storage overhead on your encrypted columns.

I’m just going to chuck a few extra rows into the table (including a duplicate record for me) so I can demonstrate a couple of points mentioned earlier. Here’s what we have when I’m done:

AE_Select3

Now let’s look again at the encrypted version of that:

AE_Select4

Remember in my table definition, I set LastName to be Deterministic Encryption, but FirstName to be randomized. If you look at the encrypted values above we can see some repeated values in the LastName column – even if we didn’t have the Ids to compare with the first recordset it’s pretty straightforward to work out which is “McGiffen” and which is “Rubble”.

Then look at FirstName. We know both records 1 and 2 though have a first name of “Matthew”. If we check the encrypted values in the FirstName column we can see they are both different – this is Randomized Encryption in practice.

Querying Encrypted Data With a Predicate

I’m going to go back to my encrypted (and parameterized) connection and try running some queries against those tables:

DECLARE @LastName NVARCHAR(32) = 'McGiffen';

SELECT * FROM dbo.EncryptedTable

WHERE LastName = @LastName

Results:

AE_Select5

So that worked fine, let’s just try and isolate those first two duplicates at the beginning though:

DECLARE @LastName NVARCHAR(32) = 'McGiffen';

DECLARE @FirstName NVARCHAR(32) = 'Matthew';

SELECT * FROM dbo.EncryptedTable

WHERE LastName = @LastName

AND FirstName = @FirstName;

This time I have no luck. I get the following error message that is telling me in a long-winded manner that I can’t compare against a column with Randomized encryption:

Msg 33299, Level 16, State 2, Line 35

Encryption scheme mismatch for columns/variables ‘@FirstName’, ‘FirstName’. The encryption scheme for the columns/variables is (encryption_type = ‘RANDOMIZED’, encryption_algorithm_name = ‘AEAD_AES_256_CBC_HMAC_SHA_256’, column_encryption_key_name = ‘MyColumnKey’, column_encryption_key_database_name = ‘AlwaysEncryptedTest’) and the expression near line ‘8’ expects it to be (encryption_type = ‘DETERMINISTIC’) (or weaker).

Msg 8180, Level 16, State 1, Procedure sp_describe_parameter_encryption, Line 1 [Batch Start Line 27]

Statement(s) could not be prepared.

An error occurred while executing batch. Error message is: Internal error. The format of the resultset returned by sp_describe_parameter_encryption is invalid. One of the resultsets is missing.

Of course this is exactly what we expected to happen.

What I want to get across is that this all makes perfect sense. If you understand what encryption is doing, then the limitations you face naturally arise from that. Key is understanding that the encryption occurs at the client side, so if you want to do anything in a query that requires the SQL engine to interact with an encrypted column then you are going to be limited. Mostly all it can do is compare two encrypted values and verify they are the same – and only if they are using the Deterministic setting. So, if you’re implementing AE, just think carefully about what columns you need to encrypt and particularly about how you need to interact with those columns.

What’s occurring in the background?

In the last post about certificates I mentioned the extra steps that AE has to go through when querying your data. Now we’ve got a table to play with, let’s look at a Profiler trace so we can see exactly what that involves in terms of interactions with SQL. The actions that occur within the client side libraries are a black box to us, but there is still some interesting stuff to be observed at the SQL end.

I’ll just set Profiler going using a default trace and insert another record into my table.

What we see when we do this is that there is an extra call to the database before the query is executed:

EXEC sp_describe_parameter_encryption

N'

DECLARE @LastName AS NVARCHAR (32) = @pce03080ba9844c4182a38c82216e4ad1;

DECLARE @FirstName AS NVARCHAR (32) = @pba60ad612001418aac8496e355e03892;

INSERT INTO dbo.EncryptedTable (LastName, FirstName)

VALUES (@LastName, @FirstName);

,N’@pce03080ba9844c4182a38c82216e4ad1 nvarchar(32),@pba60ad612001418aac8496e355e03892 nvarchar(32)’

This call is the client (SSMS) asking the server for details about any encryption that might be involved in the query. You’ll notice that SSMS has parameterized the query, replacing the literal values with dynamically named parameters.

The stored procedure returns two result-sets. The first contains a row for each Encryption Key involved in the query. In this case that’s just one, thus one record, so I’ve transposed it to make it easier to read:

AE_Desc_Param1

You can see this is sending back to client the encrypted value of the Column Encryption Key – and also telling it where to find the certificate of the local machine to decrypt that key.

AE_Desc_Param2

The second set describes each parameter, which of the identified Keys (from the first set) it is using, and also tells us whether it is using Deterministic or Randomized encryption (this is the “column_encryption_type” value).

Armed with that information, the client can then perform any encryption required and execute the actual query:

EXEC sp_executesql N'DECLARE @LastName AS NVARCHAR (32) = @pce03080ba9844c4182a38c82216e4ad1;

DECLARE @FirstName AS NVARCHAR (32) = @pba60ad612001418aac8496e355e03892;

INSERT INTO dbo.EncryptedTable (LastName, FirstName)

VALUES (@LastName, @FirstName);

',N'@pce03080ba9844c4182a38c82216e4ad1 nvarchar(32),@pba60ad612001418aac8496e355e03892 nvarchar(32)'

,@pce03080ba9844c4182a38c82216e4ad1=0x01F2CD73FDD15216E2D4DC89EB6DC046EBE3FFC70A967BB6A3F822B57EA840F9D60410825455391AAFE7DF161CE78F4C3D4DE65ED8FC2435115C4E0F81E49AD820

,@pba60ad612001418aac8496e355e03892=0x01C4FBE035F600CBF61B2589028863DF732973166D9752B29CBBF1C7814DF6E8BDAD0D95897D79E28884E103F350506B5465A0914AD22EE8C4BB6630FF02F39DB4

You can see that the parameterized query has now been executed via a call to sp_executesql. You can also see that those parameters have been set to the long binary values which represent the encrypted versions of the data I specified – which in this case was “Smith” and “John”.

I covered this briefly in the last post, but it’s good to understand that the extra step before executing the actual query has to happen even when there is no encryption involved as there is no way for the client to know which columns are encrypted and which are not. As such, once you are working with AE and have the Column Encryption Setting enabled for your connections, there will be some overhead on pretty much all your queries. Important to note though – if you are doing this via .NET code then all this extra process just happens automatically for you in background. From your point of view, you just code as normal.

Encrypting Existing Data

This post wouldn’t be complete unless I touched on this subject at least briefly. So that’s what I’ll do. Touch on it briefly.

So, how do you change an existing column containing data, so that it is encrypted?

If you’ve got this far then you can probably figure out for yourself that it’s not going to be as simple as an ALTER TABLE statement. Encryption is a client side function, so if you want to encrypt some data then that data’s got to go via a client.

There are various ways you can do that. You could hand crank something and do it that way. Generally your best bet is going to be a process that involves creating a new table with the encryption set up as you want it, copying over data from the old table into the new, than dropping the old table and renaming the new one. So no, this is not going to be a particularly online activity.

It can be a fairly straightforward one though as Microsoft has provided the tool to do that for you via SSIS using the SQL Server Import Export Wizard.

Here’s a nice MSDN post that takes you through that process

https://blogs.msdn.microsoft.com/sqlsecurity/2015/07/28/encrypting-existing-data-with-always-encrypted/