What I Love About My Job

December 7, 2023December 7, 2023 ~ Matthew McGiffen ~ 2 Comments

In June this year I took redundancy from the company I’d worked at for the previous 11 years. It was of course an uncertain time while I figured out what I wanted to do, having to consider the needs of my family as well as myself.

It was also a great opportunity. For many years I’d toyed with the idea of striking out on my own as an independent data consultant. I’d run my own popular blog on topics around Data architecture and SQL Server for a long time and had recently had a book published on database encryption.

I felt reasonably well known and respected in the larger data community. With a little money in the bank, I decided to go for it.

I haven’t looked back. Clients were slow coming in initially and marketing myself wasn’t my speciality but since then I’ve worked with a number of clients and every one has been an enjoyable and rewarding experience.

I hadn’t realised how much I missed working with clients directly, achieving satisfaction, building trust and relationships.

All of my clients so far have been small to medium enterprises without the resource to have an in-house data specialist. I’ve loved working with them because it’s where I can make the most difference. The majority have also, by random coincidence, been in the education sector, so it’s been great making a difference to organisations that are making a difference to so many young people lives.

I also love working with people to improve their understanding of the issues they face, and empowering them to be able to own things moving forward, rather than being reliant on myself or other consultants every time they have a problem.

Some might say that’s not the best model in terms of generating myself future work, but I strongly believe it is the best model in terms of building a strong relationship based on trust, and I believe it is the best model in terms of providing a high-quality service. I also love helping people reaching those light-bulb moments where suddenly they “get it”.

We’ve worked together on such things as:

Troubleshooting performance
Assessing scalability
Improving security posture – including encryption
Automation
Server migrations
Implementing DevOps
High-availability and disaster recovery
Optimizing database configuration

And the list goes on.

That’s been my last six months and I’m looking forward to the next 12 in a way I wasn’t as an employed individual.

If you have a problem or project where you feel you could benefit from some Data Architecture or SQL Server expertise, then please get in touch. You can check out the sort of services I offer here:

Services

And you can contact me through Linked In:

https://www.linkedin.com/in/matthew-mcgiffen-a8a6633/

My website:

Contact

Or Twitter / X:

Tweets by MattMcgiffen

Thanks for reading!

T-SQL Tuesday 167 – Encryption and Data Protection Roundup

October 17, 2023October 17, 2023 ~ Matthew McGiffen ~ 2 Comments

Huge thanks to everyone who responded to my invitation to blog on Encryption and Data Protection for this month’s T-SQL Tuesday.

I got what I hoped for, which was a wide(ish) interpretation of the topic. Posts from those who are clearly advocates of encryption, and from those who have scepticism around the encryption approaches that some people take.

We’ve also got discussions about the importance of basic data governance, and the protection of intellectual property.

We’ve got useful advice and warnings of gotchas.

In the order that the posts came in, here are the contributions:

Rod Edwards https://www.sqlrod.com/post/tsql-tuesday-tde-behaviour-and-status-reporting-a-curious-case
Rod demonstrates how the GUI and DMVs can report misleading information about the state of encryption when something has gone wrong, and how to interpret them correctly. He also shows how enhancements in SQL Server 2019 help us out.

Rob Farley https://lobsterpot.com.au/blog/2023/10/10/how-encrypted-is-encrypted/

Rob takes an overview of encryption, but makes the important point that nothing is foolproof. There is always a weak point. He also stresses the importance of having trustworthy staff – particularly those with high levels of access.

Jiri Dolezalek https://www.sqlkoala.com/2023/10/10/how-to-protect-your-intellectual-property-in-sql-server/

Jiri discusses how developers of Software products can protect their intellectual property when releasing their SQL Server code and data to run on customer’s infrastructure where you can’t control who can access code and data. As Jiri states, there is no perfect solution.

Andy Yun (better know to many as SQLBek) https://sqlbek.wordpress.com/2023/10/10/t-sql-tuesday-167-tde-and-data-protection/

Andy takes a hard-swipe at Security Theatre – where we do things to check boxes on security checklist, but achieve little or nothing in the process, I’ve certainly seen a lot of that in my time. He’s also not a fan of TDE! I agree with a lot of what he says, I still think there is a place for TDE – but the scenarios it protects you from are pretty limited. If you really want to protect your data you need to consider a more complete solution with at-rest encryption as just one layer.

Deborah Melkin https://debthedba.wordpress.com/2023/10/10/t-sql-tuesday-167-data-protection/

Deborah talks about the importance of knowing your data and I’d agree this comes above all else in protecting it. If you don’t know what you store or where you store it, how can you manage and protect it effectively? Knowing your data is the first step in good data governance.

Chad Callihan https://callihandata.com/2023/10/10/t-sql-tuesday-167-encryption-and-data-protection/

Chad reminds us of how TDE interacts with tempdb, and the impact that can have on servers where you have a mixture of encrypted and unencrypted databases.

Jeff Mlakar https://www.mlakartechtalk.com/t-sql-tuesday-167-encryption-data-protection/

Jeff talks about how good security is a layered approach and how the human factor is the biggest threat. He gives a good approach to protection which is to deal with the “low hanging fruit” and move up from there.

Rob Douglas http://gumption.co.nz/2023/10/demystifying-encryption/

Rob discusses how encryption can seem complex and daunting, but how in reality it is now easier than ever to encrypt your data. He refers to a talk he gave where he was able to demo all the main forms of encryption with practical examples and code in under an hour.

Thanks again everyone!

T-SQL Tuesday: Encryption and Data Protection

September 29, 2023September 29, 2023 ~ Matthew McGiffen ~ 15 Comments

I have the honour of hosting this month’s T-SQL Tuesday. In case there’s any doubt that’s for October 2023, I’m posting this invitation a little early as I got my dates clashed and I’ll be on holiday in Majorca next week!

For those of you who don’t know, T-SQL Tuesday is a monthly blog party where pople are invited to blog on a chosen subject related to SQL Server. You’re invited to join in whether you’re an experienced blogger, or have never blogged before. It’s a nice way to get started writing with a ready audience who’ll view your post.

Any of you who know me will know I have a keen interest in SQL Server Encryption and Data Protection so that’s the topic for this month’s blog party.

Any contributions relating to that topic are most welcome, and treat it as you see fit. Maybe you have a technical detail or discovery you want to share. Maybe you have a useful “how to”. Equally welcome are stories about encryption gone wrong – or incidents where data wasn’t protected sufficiently, and bad things happened. Or stories with a happy ending. It’s a deliberately wide topic, and I’d love a wide variety of contributions.

Whatever it is that comes to mind, I’d love your input and to hear your stories. I’ll review all the posts and aim to produce a roundup by the following Tuesday.

The usual rules apply:

Publish your post on Tuesday 10th October, or close enough that it can be vaguely jusified as happening on that date in a time zone somewhere.
Include the T-SQL Tuesday logo at the top of your post and link it back to this invitation. I should get a pingback so I know you’ve posted – but it’s worthwhile sharing me the link in the comments to this post too so I definitely don’t miss it.
If you’re on Twitter (I refuse to call it X), tweet your post using #tsql2sday
Have fun!

Manually updating Statistics in SQL Server

September 6, 2023September 6, 2023 ~ Matthew McGiffen ~ 1 Comment

Having up to date statistics is vital for getting the best performance out of your queries. Even though SQL Server automatically updates statistics in the background for you (When do statistics get updated?), you may find there are times when you want to manage updating them yourself.

You may have large tables and find that the interval between the automatic updates is too big and is resulting in sub-optimal query plans.
You might need timely updates to a specific object – maybe as part of an ETL process to make sure that statistics are up to date after a particular part of the process, perhaps after a daily load into a table.
You may find that the automatic updates look at too small a sample size and you need to scan more of the table to enable accurate estimates for your queries.

My previous post on the Ascending Key problem demonstrated a common issue where the first of these scenarios could be affecting you. We’ll look at examples of the other scenarios in subsequent posts.

For now though, let’s just look at how you go about updating statistics.

UPDATE STATISTICS

At the heart of all the methods we’ll look at is the UPDATE STATISTICS command. There are a lot of options for using this command, but we’ll just focus on the ones you’re most likely to use. For full documentation here is the official reference:
https://docs.microsoft.com/en-us/sql/t-sql/statements/update-statistics-transact-sql

You can use this at the table level in which case all statistics objects belonging to that table will be updated:

UPDATE STATISTICS dbo.Test;

Or you can specify the specific statistics object you want to update:

UPDATE STATISTICS dbo.Test _WA_Sys_00000002_3AD6B8E2;

FULL SCAN or SAMPLE

The other thing you may be likely to want to specify is whether the statistics should be updated using a full scan of the table, or just be looking at a sample of the rows. In the above examples we didn’t specify this so SQL Server will decide for us. In general sampling (as opposed to full scans) kicks in when we have about 8MB of data in the table (or about 1000 pages).

If you want to specify a full scan the syntax is as follows:

UPDATE STATISTICS dbo.Test _WA_Sys_00000002_3AD6B8E2 WITH FULLSCAN;

If you want the statistics update to use sampling (more on how this works in subsequent posts) then you can choose to specify a percentage of the total table to be sampled:

UPDATE STATISTICS dbo.Test _WA_Sys_00000002_3AD6B8E2 WITH SAMPLE 10 PERCENT;

Or you can specify a fixed number of rows:

UPDATE STATISTICS dbo.Test _WA_Sys_00000002_3AD6B8E2 WITH SAMPLE 10000 ROWS;

You might want to use a sample as once your tables get large full scans can take a little time. Equally though if you’re updating statistics in a quieter time (e.g. overnight) you may feel you can afford the extra time for the full scans. Here’s some comparison figures I produced on my local machine showing how long full scans take. Obviously this will change depending on your hardware and other factors including how much of the table is already in memory:

You can see however that the time taken pretty much scales linearly as the table size increases.

One thing to be aware of is parallelism. A full scan can run as a parallel operation on your server which can speed it up considerably. When you update statistics using a sample however this can only run single-threaded unless you’re on SQL Server 2016 (or higher). Sampling data to build statistics in parallel was one of the many excellent little improvements in 2016.

sp_UpdateStats

This is another method you might use for manually updating statistics (perhaps as part of a scheduled maintenance job). This system stored procedure can be used for updating all of the statistics objects in a database:

USE Test;
EXEC sp_UpdateStats;

This stored procedure iterates through your database using a WHILE loop and executes the UPDATE STATISTICS command as it goes. One nifty thing about using this procedure is that it only updates statistics objects where rows have changed, so you don’t have any overhead for refreshing statistics where the underlying data hasn’t been modified. You can see this from this extract of the output of the stored procedure:

Updating [dbo].[TestMemory]
[PK__TestMemo__3214EC070D799003], update is not necessary…
0 index(es)/statistic(s) have been updated, 1 did not require update.

Updating [dbo].[TestMemory2]
[PK__TestMemo__3214EC07D3DC52DE], update is not necessary…
0 index(es)/statistic(s) have been updated, 1 did not require update.

Updating [dbo].[Test]
[PK_Test] has been updated…
[IX_Test_TextValue] has been updated…
2 index(es)/statistic(s) have been updated, 0 did not require update.

Ola Hallengren

Of course, if you’re looking to implement statistics update as part of regular maintenance, then you should definitely be considering using Ola Hallengren’s maintenance solution. Ola maintains a great solution for managing database backups and integrity checks as well index and statistics maintenance, and he shares it with the SQL community for free.

You can find full details of Ola’s solution and download it here:
https://ola.hallengren.com/

Got a problem or embarking on a SQL Server project and want some help and advice? I’m available for consulting – please get in touch or check out my services page to find out what I can do for you.

Statistics and the Ascending Key Problem

July 21, 2023September 6, 2023 ~ Matthew McGiffen ~ 2 Comments

I’ve mentioned previously how not having up to date statistics can cause problems in query performance. This post looks at something called the Ascending Key Problem which can badly affect your cardinality estimation in some cases and therefore your execution plans.

The Ascending Key Problem relates to the most recently inserted data in your table which is therefore also the data that may not have been sampled and included in the statistics histograms. This sort of issue is one of the reasons it can be critical to update your statistics more regularly than the built-in automatic thresholds.

We’ll look at the problem itself, but also some of the mitigations that you can take to deal with it within SQL Server.

The Problem

Imagine you have a table that stores a set of events. As new records are inserted they get stamped with the current date and time. You regularly query that table based on that EventDate looking to find recent events, let’s say just for the current day.

Even if you haven’t indexed the EventDate column (though why haven’t you?!), as long as you have AUTO CREATE STATISTICS and AUTO UPDATE STATISTICS on for your database you’ll have “reasonably” up to date statistics for that column.

But “reasonably” may not be good enough. When you’re querying against the most recent data it may not yet have been sampled by a statistics update and the range you are looking for may fall beyond the top of the histogram captured in the statistics object for EventDate. Imagine that statistics were last updated yesterday. When the Optimizer checks the statistics to estimate a rowcount for today’s date it finds that is above the top bound. So what should it guess?

Historically it would guess that there were zero rows, but as always the cardinality estimation gets set to the minimum of 1. If the real answer is a lot larger you might end up with a bad plan.

Let’s look at that in practice.

Staying true to the example above, I create a table called Events and I index the EventDate column:

CREATE TABLE dbo.Events
(
Id INT IDENTITY(1,1) CONSTRAINT PK_Events PRIMARY KEY CLUSTERED,
EventName VARCHAR(255) NOT NULL,
EventDate DATETIME CONSTRAINT DF_Events_EventDate DEFAULT (GETDATE())
);

CREATE INDEX IX_Events_EventDate ON dbo.Events(EventDate) include (EventName);

Then I insert records to represent events at one minute intervals for 100 days:

--Insert data for 100 days at minute intervals from the start of this year
DECLARE @StartDate DATETIME = '20170101 00:00.00';

INSERT INTO dbo.Events(EventName, EventDate)
SELECT
    'Event' + CAST(num.n AS VARCHAR(10)),
    DATEADD(minute,num.n,@StartDate)
FROM 
(
    SELECT TOP 144000 row_number() OVER(ORDER BY (SELECT NULL)) AS n 
    FROM sys.objects a, sys.objects b, sys.objects c
) num;

I’m going to query to check what date range was inserted. That should have the additional advantage of triggering a statistics update:

SELECT MIN(EventDate), MAX(EventDate)
FROM dbo.Events;

As a slight digression, it’s interesting to look at the execution plan here:

You can see two index scans. That sounds horrendous, scan the index twice to find the MIN and MAX? If you look at the properties though you can see it only read one row in each case:

An index scan doesn’t have to read all the records in the index, it can bail out once it is satisfied. For a MIN or MAX type query it makes perfect sense just to jump to one end of the index and start scanning.

The side lesson is that Scans aren’t always bad for performance.

Anyway, back to the topic in hand. Now let’s look at the statistics:

DBCC SHOW_STATISTICS('dbo.Events',IX_Events_EventDate);

You can see they’re up to date. They show 144,000 rows in total which is correct. Interestingly the Histogram (bottom result-set) only has a couple of steps. SQL has determined that the data is uniformly distributed so has bunched it altogether. Clever stuff!

Let’s insert data for another day:

--Insert one more day's data
DECLARE @StartDate DATETIME;

SELECT @StartDate = MAX(EventDate) FROM dbo.Events;

INSERT INTO dbo.Events(EventName, EventDate)
SELECT
    'Event' + CAST(num.n AS VARCHAR(10)),
    DATEADD(minute,num.n,@StartDate)
FROM 
(
    SELECT TOP 1440 row_number() OVER(ORDER BY (SELECT NULL)) AS n 
    FROM sys.objects a, sys.objects b, sys.objects c
) num;

Now I query to see the new events. I captured the MAX(EventDate) earlier so let’s use that to find the new records:

SELECT *
FROM dbo.Events
WHERE EventDate > '20170411'
OPTION (RECOMPILE);

(Notice I’ve added the option to recompile so I get a fresh plan each time I run this, that will be important for testing)

Let’s not bother with the results, we all know there will 1,440 records that are basically the same. Here’s the execution plan:

The interesting bit comes when I look at the properties for the Index Seek:

Estimated number of rows = 1, Actual = 1,440. That’s quite a long way out. Of course here we have a trivial query so the massive underestimate isn’t affecting our plan. If we started joining to other tables though it would likely result in a massively inefficient plan – perhaps choosing a Nested Loops join over a Hash or Merge.

Note I’m using SQL Server 2012 for this test and I’m not using the Traceflag (2371) which reduces the threshold for statistics updates ( When do Statistics Get Updated? ):

So I’ve got nearly another 30,000 rows to insert before statistics get automatically updated and my estimates come into line. If I’m always querying for the current day then it’s going to be very rare that statistics are going to be able to help me with a good estimate.

So what’s the fix?

Before we get on to the methods that have been introduced to try and ameliorate this problem, if you face this sort of scenario you might want to consider whether you need to update your statistics objects more often than the auto-stats threshold. If you have a regular job to rebuild fragmented indexes then those indexes that get rebuilt will have their statistics refreshed – however that won’t cover the auto created statistics, and it won’t cover statistics for tables that get don’t get rebuilt.

So, if you don’t have a specific scheduled job to regularly update statistics that is definitely worth considering.

In terms of how SQL has changed to help us, from SQL Server 2005 SP1, the nature of columns began to be tracked, monitoring updates of statistics to understand how the data changes. This additional information can be seen if you enable traceflag 2388, then view the statistics. Let’s have a look at what’s gathered. First I’m going to add a couple more days of data, updating the statistics between each insert, then I run the following:

DBCC TRACEON(2388);
DBCC SHOW_STATISTICS('dbo.Events',IX_Events_EventDate);
DBCC TRACEOFF(2388);

What you see here is historical information about the updates to the statistics. This is undocumented stuff, but some of what we see we can work out the meaning for. In particular we can see how many rows were inserted since the last statistics update, and how many of those values were above the top of the old histogram. We also see a column “Leading Column Type” which has a value of “Unknown”.

Now I’m going to insert another day’s date and update the statistics once more, then we’ll look at this again:

You can see that now we have a Leading Column Type of “Ascending”. After three updates to the statistics where the Leading Value was only increasing each time, SQL Server will identify that it is an ascending column. It must be at least three updates before SQL will recognise this, and if that stops being the case (i.e. some lower values are inserted) then the next statistics update will reset this until we again get three consecutive updates with only increasing values.

This happens in the background and you don’t need the traceflag 2388 to make it happen –that is just so you can see what is going on.

The obvious question is, now SQL knows my column is ascending, has that affected the estimation for my query? Before we look I’ll insert another day of data so there is some data beyond the histogram, and then I’ll query again:

SELECT *
FROM dbo.Events
WHERE EventDate > '20170415'
OPTION (RECOMPILE);

And the properties from the execution plan:

So nope. Nothing has changed.

Yet…

To tell the query optimizer to take advantage of this extra information for ascending keys we have traceflag 2389. Let’s enable that and run the query again:

DBCC TRACEON(2389);

SELECT *
FROM dbo.Events
WHERE EventDate > '20170415'
OPTION (RECOMPILE);

The properties:

Voila! SQL Server has now estimated my rowcount perfectly.

Now, be warned. This is a rather contrived example with a perfectly and artificially smooth distribution of data. The estimate is made by checking the current maximum value in the table, and using that combined with the information existing in the statistics and the value of your predicate to extrapolate a guess. If you’re data is evenly distributed as it is here then the guess will be pretty good, if it is fairly skewed then it may be bad.

In any case though it will probably be better that the fixed value of 1 that would have been used historically.

One thing to note is that traceflag 2389 is only going to have any affect if the leading column of the relevant statistics object has been marked as ascending. There is also traceflag 2390, and this will adopt a similar behaviour even if your column hasn’t been identified as ascending, i.e. it will check the maximum value in the table and if it is higher than the max value in the histogram, it will extrapolate to work out the cardinality estimate.

So should you turn on the traceflag(s) globally?

The Microsoft recommendation is not to enable traceflags such as these unless you are suffering from the specific problem they are aiming to resolve, and even then to make sure you test carefully to ensure they are achieving what you desire.

One issue can be that in more complex queries there are a number of cardinality estimates being made. It can be that two bad estimates within the same plan might cancel each other out and the query overall performs fine. If you then implement something that fixes one of them, you risk such queries going bad – a scenario known as plan regression.

This sort of scenario is one of the reasons why Microsoft have made very few core changes to the cardinality estimator since it came out.

So, use 2389 is you are specifically encountering this sort of ascending key problem, but also, if you are in the position to change the code then you might want to consider adding it as a query hint so it only affects the specific query you are targeting. For our example query above, that would simply look like:

SELECT *
FROM dbo.Events
WHERE EventDate > '20170415'
OPTION (RECOMPILE, QUERYTRACEON 2389);

Welcome to SQL Server 2014 (and later)

In 2014 we received a substantial revamp of the Cardinality Estimator, the first since SQL Server 7.0. A bunch of assumptions and algorithms have been re-jigged based on the real-world data that Microsoft have seen in supporting their wide customer base.

Key to having a new version was the concept that, henceforth, optimizer fixes would be tied to the database compatibility version. This means that customers can upgrade their SQL Server version but if they find performance problems related to the upgrade they can downgrade their database’s compatibility level while the issues are resolved within their codebase.

One of the items specifically looked at in the new version was this Ascending Key problem. To see how things work in the latest versions I’m going to repeat many of the steps above using a database deployed on SQL Server 2016.

So I:

Create my table again
Populate with the 100 days data
Run a query to check the dates, which has the added benefit of updating statistics
Add one more day’s data

Then I’m ready to run my test query again:

SELECT *
FROM dbo.Events
WHERE EventDate > '20170411'
OPTION (RECOMPILE);

I get the same execution plan as ever so again I jump to the properties of the Index Seek operator to look at the estimates:

Now, this is interesting. I might have expected I would get either 1 row estimated (the old model) or 1,440 (the model with traceflag 2389). Instead I get 432 rows. It seems the new CE (Cardinality Estimator) uses a different algorithm.

Sometimes numbers stick in your head. I happen to know that where no statistics are available and you are querying with an inequality predicate (<, > , <=, >=) that the CE will estimate the number of rows to be 30% of the total number of rows in the table. This assumption seems to have originated in a 1979 research paper from IBM suggesting 1/3 was a good guess.

With 30% in my head I noticed that 432 is 30% of 1440. So it seems that the optimizer is recognising that we are querying for values above the histogram (where no statistics exist) with an inequality, it knows from somewhere that there have been 1440 rows inserted since the last statistics update, so it takes 30% of 1440 to produce the estimate (432).

To try validate that theory I thought I’d query with a later datetime in the predicate. Sure enough, if I add 12 hours I still get 432 rows estimated. If I add 23 hours, 432 rows. In fact if I query for any date in the future, even outside of the maximum value in the table, guess what – I get an estimate of 432 rows.

I have a fascination for the algorithms involved in distribution statistics. It satisfies the maths geek in me. As such it’s difficult to end a post like this, there’s always more things to test, to try and work out. For instance what happens if you query across an interval that starts within the current histogram, but then extends above it? I’ll admit I’ve had a play, but will leave that for another post.

As a very final point in this post, I thought I’d just check whether the 2389 traceflag makes any difference to this estimation with the 2014 CE. I’ll change my query to look way into the future, enable the traceflag and look at the estimate:

SELECT *
FROM dbo.Events
WHERE EventDate > '99991231 23:59:59'
OPTION (RECOMPILE, QUERYTRACEON 2389);

Guess what? Still 432 rows… so no, the traceflag doesn’t appear to still give us any benefit.

Though when we get to that date it’ll be someone else’s problem to sort out!

When do Statistics get updated?

June 28, 2023June 28, 2023 ~ Matthew McGiffen ~ 5 Comments

Statistics objects are important to us for allowing SQL to make good estimates of the row-counts involved in different parts of a given query and to allow the SQL Optimiser to form efficient execution plans to delivery those query results.

Statistics get updated automatically when you rebuild (or re-organise) an index they are based on – but we only tend to rebuild indexes that are fragmented, and we don’t need fragmentation for statistics to be stale. We also may have many auto-created statistics objects that are not related to an index at all.

It’s generally recommended to have the database level setting AUTO_UPDATE_STATISTICS turned on, so that SQL can manage the process of keeping statistics up to date for us. The only excuse to turn it off is that you are managing the updates to stats yourself in a different manner. And you can always turn the auto update off at an individual table or statistics level if you need to, rather than for the whole database.

SQL Server has had the ability to automatically update statistics since version 7.0. Nonetheless for a long part of my career working with SQL Server, whenever a performance issue raised its head everyone’s knee-jerk response would be “Update Statistics!” In most cases though the people shouting that didn’t really understand what the “Statistics” were, or what mechanisms might already be in place for keeping them up to date.

Of course SQL Server isn’t perfect and sometimes it is helpful for human intelligence to intervene. But to provide intelligent intervention one has to understand how things work.

So how does the automatic updating of statistics work?

In the background SQL maintains a count of changes to tables that might affect statistics. This can be updates, inserts or deletes. So if I inserted 100 records, updated 100 records and then deleted 100 records, I would have made 300 changes.

When SQL forms an execution plan for a query it references various distribution statistics objects to estimate row-counts and to use that to try find the best plan. The statistics objects it looks at are referred to as being “interesting” in the context of the query.

Before using values from the statistics, the Optimizer will check to see if the statistics are “stale”, i.e. the modification counter exceeds a given threshold. If it does, SQL will trigger a resampling of the statistics before going on to form an execution plan. This means that the plan will be formed against up to date statistics for the table.

For subsequent executions of the query, the existing plan will be loaded from the plan cache. Within the plan, the Optimiser can see a list of the statistics objects that were deemed “interesting” in the first place. Once again it will check each of them to see if they are “stale”. If they are, an auto-update of the statistics object(s) will be triggered and once that is complete the plan will be recompiled, in case the updated statistics might suggest a better way of executing the query. Equally, if any of the statistics objects have been updated since the last execution then the plan will also be recompiled.

One important caveat to this is the database level setting AUTO_UPDATE_STATS_ASYNC (Asynchronously). Generally it is best to have this turned off, in which case the above behaviour is observed. If you turn it on however, in the case of stale stats the query execution will not wait for the stats to be updated, but will start them updating in the background while the query executes. The plan will only recompile to be based on the new stats at the next execution.

From SQL Server2008 R2 SP2 and SQL Server 2012 SP1 we had a new DMF (Dynamic Management Function) sys.dm_db_stats_properties that allows us to see how many row modifications have been captured against a given statistics object as well as when it was last refreshed, how many rows were sampled etc. Modifications are captured on a per column basis (though when statistics were originally introduced in SQL Server it was per table) so the counter will only be affected if the leading column for the statistics object has been affected by a given operation.

SELECT s.name AS StatsName, sp.* FROM sys.stats s CROSS apply sys.dm_db_stats_properties(s.OBJECT_ID, s.stats_id) sp WHERE s.name = 'IX_Test_TextValue'

Results:

So what are the thresholds?

For a long time the thresholds were as follows. Statistics were considered stale if one of the following was true:

The table size has gone from 0 rows to more than 0 rows
The table had 500 rows or less when the statistics were last sampled and has since had more than 500 modifications
The table had more than 500 rows when the statistics were last sampled and the number of modifications is more than 500 + 20% of the row-count when the statistics were last sampled (when talking about tables with larger row-counts a lot of the documentation just describes this as 20% as the additional 500 becomes less and less relevant the larger the number you are dealing with).

Those thresholds did mean that when a table had a large number of rows, Statistics might not get updated that often. A table with a million rows would only have stats updated if about 200,000 rows changed. Depending on the distribution of the data and how it is being queried this could be a problem.

So, in SQL 2008 R2 SP2 Microsoft introduced Traceflag 2371 which when set would reduce the stale statistics threshold for larger tables. From SQL 2016 this is the default functionality.

That adds the following test for statistics being stale:

If the number of rows (R) when the statistics were last sampled is 25,000 or more and the number of modifications is more than the square root of R x 1000:

Now, I’m just going to correct myself here, the documentation I’ve found SAYS the threshold is 25,000 but when I started to have a play that didn’t seem to be the case at all.

What actually seems to happen is that whichever of the two estimates is smaller gets used i.e

Either:

Or:

Whichever is smaller.

I don’t know if this means that both get evaluated and the smaller is used, or if the threshold between the two rules is simply defined at the point where the second formula gives the smaller result – which is after 19,682 rows. I discovered that threshold by solving where the two equations above would give the same result – then by experimenting to prove it in practice.

I think this incorrect stating of 25,000 as the threshold probably comes from confusion, taking an approximation (20%) as the actual figure. Remember I mentioned that people often generalise to say that statistics are stale after 20% of the rows change, and forget about the extra 500 rows. If that was true and it was exactly 20%, then the threshold would be 25,000 as that would be the point that both equations are equal.

Anyway it’s not even vaguely important to know that. I just found it interesting! Note that the tests above were carried out on SQL Server 2012 SP3 so could well be different on later versions.

To more visually understand the above rules, here’s a table showing the thresholds for some example table sizes under both the Old algorithm (without the traceflag) and the New algorithm (with the traceflag or on SQL 2016 or later).

R is the number of rows when the statistics were last sampled and T is the number of modifications for statistics to be considered stale:

You can see for the larger table sizes there is a massive difference. If you’ve got large tables you’re querying against and are having to update the statistics manually to keep them fresh then you may find implementing the traceflag is a help.

For large tables statistics are sampled when being updated rather than the whole table being necessarily being read. I have details on that in this post:

Automatic Sample Sizes for Statistics Updates

What are Statistics in SQL Server?

June 21, 2023June 21, 2023 ~ Matthew McGiffen ~ 1 Comment

Statistics are vitally important in allowing SQL Server to find the most efficient way to execute your queries. In this post we learn more about them, what they are and how they are used.

Cardinality Estimation

Cardinality is a term originally from Mathematics, generally defined as “The number of objects in a given set or grouping”. In SQL we’re continually dealing with sets so this becomes a very relevant topic, which in our context is just the “number of rows”.

When you have a query across multiple tables there any many ways in which SQL Server could decide to physically go about getting you the results you want. It could query and join the tables in any order and it could use different methods for matching records from the various tables you have joined together. It also needs to know how much memory to allocate for the operation – and to do that it needs to have an idea of the amount of data generated at each stage of processing.

A lot of this requires cardinality estimation, and SQL Server uses something called Statistics objects to perform that calculation.

Let’s look at a simple example:SELECT * FROM Person.Person p INNER JOIN Person.[Address] a ON p.AddressId = a.AddressId WHERE p.LastName = 'Smith' AND a.City = 'Bristol'

When it comes to gathering the results for this query there are a number of ways the database engine could go about it. For instance:

a) It could find all the records in the Person table with a LastName of Smith, look each of their addresses up and return only the ones who live in Bristol.
b) It could find all the Addresses in Bristol, look up the people associated with each address and return only the ones called Smith.
c) It could grab the set of people called Smith from the People table, grab all the addresses in Bristol, and only finally match up the records between those two sets.

Which of those operations is going to be most efficient depends very much on the number of records returned from each table. Let’s say that we have a million people called Smith, but there’s only one address in our whole database that is in Bristol (and let’s say that address does actually belong to someone called Smith).

In the first method above I would grab the million Smiths and then look their address up one by one in the address table until I found the one that lived in Bristol.

If I used method b) though, I would find the one matching address first, and then I would simply look up the owner of that address. Clearly in this rather contrived example, that’s going to be a lot quicker. So if SQL knows ahead of time roughly how many records to expect from each part of the query, hopefully it can make a good decision about how to get the data.

But how can it work out how many rows will be returned without actually running the query?

Statistics

That’s where statistics objects come in. SQL Server maintains in the background data that equates to a histogram showing the distribution of the data in certain columns within a table. It does this any time you create an index – statistics will be generated on the columns the index is defined against, but it also does it any time it determines that it would be useful. So if SQL encounters a Where clause on Person.LastName – and that column isn’t involved in a useful index, SQL is likely to generate a statistics object to tell it about the distribution of data in that column.

I say “likely to” because it actually depends on the settings of your SQL instance. Server configuration is beyond the scope of this post but suffice to say you can let SQL automatically create Statistics objects – or not. You can let it automatically update them when the data has changed by more than a given threshold – or not. And you can specify whether updates to statistics should happen asynchronously or synchronously – i.e. in the latter case if your query determines that statistics needs updating then it will kick that off and wait until the update is complete before processing the query.

It’s generally recommended that auto creation and updating is on, and async updating is off.

Viewing Statistics Objects
Let’s have a look at some actual statistics and see what they hold. There are a couple of ways of doing this, the first is through SSMS. If you look under a table in the object browser you will see a Statistics folder which holds any statistics objects relating to that table:

In the above example you can see some that have friendly names, these are Statistics that are related to an actual index that has been defined on the table and they have the same name as the Index – e.g. IX_Address_StateProvinceId.

You’ll also see we have some prefixed _WA_Sys and then with some random numbers following. These are statistics objects that SQL has created automatically on columns that aren’t indexed – or at least they weren’t indexed at the time the Statistics objects were created.

You can open these up with a double-click and see what’s inside:

This is the General tab you open up to. You’ll see it tells you what table the Statistics are for and what column(s). There are options for adding columns and changing the order – but you never really need to do this – as well as information to tell you when the statistics were last updated, and a check box if you want to update them now.

In the details tab there’s a lot more info:

I don’t find this the easiest display format to work with though, so rather than delving into what everything means here let’s look at the other way you can view statistics, which is by running the following command:

DBCC SHOW_STATISTICS('Person.Address', '_WA_Sys_00000004_164452B1')

The format is straightforward, you just specify the table you are interested in the Statistics for, and the actual name of the Statistics object you want. You can see the same information as if you double-clicked on it, but the results are output in the results pane like any other query and are (I think) a lot easier to read. Allegedly there will soon be a third way in SQL Server to view Statistics as DBCC commands are considered a bit “clunky” – but we don’t know what that will look like yet.

The command outputs three resultsets:

This post is just an introduction to statistics – and generally you don’t need to know that much, it’s just handy to understand the basics. So let’s just run over the key bits of information you can see above:

First of all in the first recordset – otherwise know as the…

Stats Header

Rows – is the number of rows in your table

Rows Sampled – this is how many rows were sampled to generate the statistics. SQL can generate or update statsitics using sampling rather than reading all the rows. In this case you’ll see it did actually read the whole table.

Steps – If you imagine the statistics as a bar chart – this is the number of bars on the chart. Statistics objects have a maximum of 200 steps so if you have more distinct values in your column than that they will be grouped into steps.

Density – This is supposed to be the probability of a row having a particular value (calculated as 1 / Number of Distinct values in column). According to books online “This Density value is not used by the query optimizer and is displayed for backward compatibility with versions before SQL Server 2008.” I am using SQL 2012, and this number is just plain incorrect so don’t use it…

Recordset Number 2: The Density Vector

All Density – this is the accurate version of the Density statistic described above. So your probability of a given row having a specific value is about 0.0017. That’s a little less than one in 500. I happen to know there are 575 different Cities in the table so that makes sense. Sometimes SQL will use this value to form a plan – if it knows you’re going to search this table for a specific City and it doesn’t know that City when it makes the plan, then it could guess that about 1/500th of the rows will match your criteria.

Average Length – Is what it says on the can. The average length of data in this column.

Columns – The names of any column measured in this statistics objects. You can have statistics across multiple columns but I’m not going to cover that in this post. In this case it tells us these statistics are based on the “City” column.

Recordset Number 3: The Histogram

This last recordset shows the distribution of the data, and is what you could effectively use to to draw a graph of the relative frequencies of different groups of values. Each row represents a step – or bar on the bar chart – and as mentioned above there can be a maximum of 200 steps. So you can see, a statistics object is quite lightweight, even for a massive table.

RANGE_HI_KEY – This upper limit of each step, so each step contains all the values bigger than the RANGE_HI_KEY of the last step, right up to and including this value.

RANGE_ROWS – This is how many rows in the table fall in this range – not including the number that match the RANGE_HI_KEY itself.

EQ_ROWS – The number of rows equal to the HI_KEY

DISTINCT_RANGE_ROWS – The number of different values in the range that there is data for (excluding the HI_KEY).

AVERAGE_RANGE_ROWS – The average number of rows for a given value within the range.

That’s a whistle-stop tour of the Statistics SQL Server holds on your data.

The algorithms that SQL then uses to calculate the number of rows for a given part of your query are pretty transparent when there’s just one column involved. If we look at the above example and let’s say you wanted to look up the rows where the City is “Abingdon” – the statistics tell us there is 1 matching row and that’s the figure SQL will use for cardinality estimation. Where a value is within a range then it will use a calculation based on the AVERAGE_RANGE_ROWS.

When there’s multiple columns involved it’s more complicated there are various algorithms and assumptions that come into play. If you’re interested in digging deeper, one very good resource is the Whitepaper on the 2014 Cardinality Estimator written by Joe Sack: https://www.sqlskills.com/blogs/joe/optimizing-your-query-plans-with-the-sql-server-2014-cardinality-estimator/

Conclusions

The main takeaway from this should just be the understand the extent – and limitations – of the information about the distribution of your data that SQL holds in the background.

If, when you’re tuning queries, you notice that the estimated row counts don’t match the actual, then this could be encouraging SQL to form a bad plan for the query. In these cases you might want to investigate what’s going on with the statistics.

Maybe your query is written in a way that it can’t use statistics effectively, one example of this can be where you store constant values in variables, then query using that variable in a WHERE clause. SQL will then optimise based on the average, rather than on your actual value.

Maybe the plan is based on one data value that has a very different cardinality to the one currently being queried. For instance when you first run a stored procedure, the plan is formed based on the parameters passed. Those parameters could have a cardinality that is quite different to those used in later executions.

Maybe the statistics are out of date and need refreshing. Stats get updated when approximately 20% of the data in the table changes, for a large table this can be a big threshold, so current statistics may not always hold good information about the data you are querying for.

Or maybe SQL is doing as good a job as it can with the information it has at its disposal. You might need to work out how you can give it a little extra help.

Here are some other recent posts about Statistics you may find useful:

When do Statistics get updated?

Rowcount estimates when there are no Statistics

Indexes and Statistics with Always Encrypted

June 20, 2023June 20, 2023 ~ Matthew McGiffen ~ 1 Comment

In a previous post we looked at executing queries against columns encrypted using Always Encrypted. In this short post we look at the ability to be able to have (or not) indexes on those columns. This information is relevant to the case where you are using Always Encrypted without enclaves, we’ll look at working with enclaves later on.

Where we want to query with a predicate, it is often useful to have an index on the column being searched. You can only have indexes on columns using deterministic encryption – not randomized. This makes sense as a column with randomized encryption is going to have different values stored even when the underlying plaintext value is the same, so an index would be meaningless. You also can’t perform comparisons on such columns so an index wouldn’t be useful. An index on a column deterministically encrypted with Always Encrypted will simply be an index of the encrypted values and as such is only useful for equality comparisons, not for sorting or range queries – though neither of these functions are possible against our deterministically encrypted column anyway.

To create an index on a column using deterministic encryption is the same as if the column is not encrypted. Here is an example using the database and table created in the previous posts Setting Up Always Encrypted and Executing Queries Using Always Encrypted. In our table dbo.EncryptedTable we have the columns LastName (which is encrypted using deterministic encryption), and FirstName (which is encrypted using randomized encryption):

CREATE NONCLUSTERED INDEX IX_LastName ON dbo.EncryptedTable(LastName);

While we can’t index on our columns using randomized encryption, we can include them in the leaf pages of an index so that their values do not have to be retrieved using a Key Lookup operation. So the following SQL would also work:

CREATE NONCLUSTERED INDEX IX_LastName_Include_FirstName ON dbo.EncryptedTable(LastName) INCLUDE(FirstName);

In common with any other index, a statistics object is created that maintains information about the distribution of data in the index. We can view the data held in the statistics object with the following command:

DBCC SHOW_STATISTICS('dbo.EncryptedTable','IX_LastName');

Here we see the output:

We only have a couple of rows in our table, but if you’re familiar with statistics objects, you’ll see this is pretty much the same as we would see where there is no encryption. What is interesting is to look at the last result set. Normally this would show actual data values in the RANGE_HI_KEY column; here we can see that we have an encrypted value. From this, we can understand that the statistics are based on the encrypted values. This underlines the point already made, that our index is an index of the encrypted values of our data. That makes sense as SQL Server is never aware of the unencrypted values, so it wouldn’t be possible for it to create or maintain indexes (or statistics) of the plaintext values.

This is different when using always encrypted with enclaves where indexes on columns with randomized encryption are allowed. In that case, those indexes, and associated statistics objects, are actually based on the plaintext values. As mentioned, we’ll talk about how that works in a later post.

This post is part of a comprehensive series on SQL Server Encryption. Subscribe to my blog for updates as new posts are published or you can buy my book through the link in the sidebar to get it all in one go.

And if you’re embarking on an encryption project (or anything else to do with SQL Server) and want some help, I’m available for consulting – please get in touch or check out my services page to find out what I can do for you.

Under Promise and Overdeliver

June 19, 2023June 19, 2023 ~ Matthew McGiffen ~ Leave a comment

There’s a monthly “blog party” called T-SQL Tuesday that invites blog posts on a given topic related to SQL Server and data. This month Gethyn Ellis has asked people to share the best piece of advice they’ve received in their careers.

For me that’s an easy answer. “Under promise and overdeliver” is a practice that has held me in great stead all through my IT career. I first came across the phrase in my very first development job, working at BT back in the 90s. Our team had a manager, Mike Taylor, who would say this to us on a regular basis.

If you under promise and overdeliver then you will always have happy clients.

Let me give an example. If I’m asked to look at a poorly performing SQL Server instance for a client, and I know that it’s not had much love from an optimization point of view, then I can be pretty confident I can knock the CPU consumption in half and to be honest I’m going to be disappointed if I can’t reduce it by between 75 and 90%. The client however just wants it back running within acceptable parameters. If it’s running close to 100% CPU and causing timeouts in the client, they just want that under control. They will be happy if I get that back down to 70% and the timeouts go away – so that’s the expectation I will set as a minimum target. They will be very satisfied when I achieve that, very happy when I exceed that and get things down to 50% and ecstatic if it goes down to 25% or less – especially once I mention that they have the potential to reduce the size of kit the instance is running on and save a lot of money.

If I start out by saying I’ll reduce it by 75% but only reduce it by half then they will be satisfied, but I’ve given them a disappointing outcome compared to what I promised.

Another example is in timescales when promising delivery of work. It’s usually much better to say that a fix will take a day and deliver it early or on time, and to a high quality, than to say you can knock it out in a couple of hours and then having to keep pushing that back due to unforeseen problems. Worse, when you over-promise, you end up having to spend time on update calls and on resetting expectations, which means the task ends up taking you longer. You are also rushed and it’s much more likely that poor quality will creep in.

Related to timescales, another piece of advice I’ve come across was a rule of thumb for estimation. Basically, when trying to estimate how long a piece of work might take, give it your best estimate, then double it. You might be surprised how often that’s much more accurate than the original figure. Partly because we generally want to please and tend to be over-optimistic when trying to figure out how long something will take us.

We’ve all been on the other side of this. How frustrating is it when you have a problem with a service you use and contact support to be told it will be fixed in a given timeframe and that doesn’t happen. Or that you will be contacted by a certain time, but you don’t get a call or a message. More of your time gets wasted chasing for an update – which also wastes time for the organisation the promise came from. Much better to be given a realistic expectation at the outset – even if that’s disappointing. You can at least then look at how you’re going to work around the issue in the short-term.

It’s sometimes difficult, because you have to be firm about the expectation you set at the outset, but once you have delivered a solution, a fix, or other deliverable that exceeds that expectation, people are always happy and that it what they remember after the event. If you have over-promised, that it also likely to be what they remember.

So that’s the advice I was given and I pass it on to you – “Under promise and overdeliver”.

Got a problem and could do with some assistance? Get in touch today to see how I can help.

What Happens in the Background when Executing Queries with Always Encrypted

June 15, 2023June 15, 2023 ~ Matthew McGiffen ~ 1 Comment

In previous posts we’ve looked at a number of aspects of Always Encrypted and how it works. I think it’s very useful to understand how it actually works in practice, so in this post we’re going to look at what happens in the background when you execute a query.

I’m continuing on from the examples in the last few posts on this subject. So if you want to repeat the examples for yourself you’ll need to have followed steps from Setting Up Always Encrypted and Executing Queries Using Always Encrypted.

To capture what happens in the background we’re going to run an XEvent Profiler trace so we can see what actually gets sent to the database. We start an XEvent Profiler trace from SSMS. You generally find it as the bottom item for your server in the Object Explorer. We’ll use the TSQL option, so right-click on that and select Launch Session:

What Happens with an Insert Query

Once the session has opened, we re-run our insert query from the last post and see what is captured. Here is the SQL:

DECLARE @LastName nvarchar(50) = 'McGiffen'; DECLARE @FirstName nvarchar(50) = 'Matthew'; INSERT INTO dbo.EncryptedTable (LastName, FirstName) VALUES (@LastName, @FirstName);

You can see the results below

We can see there are two queries captured – normally we would just expect to see the one we executed. What we’re seeing here is what was described in What is Always Encrypted and how does it work? when we looked at the steps needed to execute a query over a connection with Always Encrypted enabled. Before executing the query itself, the client must issue a command to retrieve the encryption metadata. It needs to understand if any of the columns targeted by parameters are encrypted, and if so, it needs information about the keys. Let’s look at the query in full that is executed:

exec sp_describe_parameter_encryption N'DECLARE @LastName AS NVARCHAR (50) = @pacc87acf4618488b80bc61f6ac68114f; DECLARE @FirstName AS NVARCHAR (50) = @p4113aa748f2e4ff585556f8eaa618f0d; INSERT INTO dbo.EncryptedTable (LastName, FirstName) VALUES (@LastName, @FirstName); ',N'@pacc87acf4618488b80bc61f6ac68114f nvarchar(50),@p4113aa748f2e4ff585556f8eaa618f0d nvarchar(50)'

Retrieving the encryption metadata is achieved by this call to the sp_describe_parameter_encryption stored procedure. SSMS has parameterized our query and then sent that text over to be parsed by the engine. We can execute the same query call to sp_describe_parameter_encryption ourselves and see what is returned. We get two result sets. There’s a fair amount of information returned so I’ll display them as tables rather than screenshots to make things more readable. The first result set gives us information about any Always Encrypted keys we will need to execute our query. In this case there is only one key involved in our query so we just get one row returned, but if there were multiple keys required, we would get a row for each.

There are a few key columns here:

column_encryption_key_ordinal is an assigned ordinal value used to identify the key and is useful when multiple keys are involved in the query.
column_encryption_key_encrypted_value is the encrypted value of the CEK as stored in the database.
column_master_key_store_provider_name tells the client where the CMK is stored.
column_master_key_path identifies the path to the actual certificate used as the CMK.

Armed with this information, the client driver can obtain the local copy of the CMK and use that to decrypt the encrypted CEK value to obtain the actual encryption key.

The second result set gives details of which parameters target encrypted columns:

Here we have two parameters as we are inserting a value into each of the two columns. The results are fairly clear, but let’s just go over a few of the values and what they mean:

parameter_ordinal is an ordinal value for the parameter.
parameter_name is the name of the parameter as supplied to the sp_describe_parameter_encryption stored procedure.
column_encryption_type describes whether encryption is deterministic or randomized. 1 means randomized and 2 means deterministic. You might also see the value of 0 which means plaintext where a parameter targets an unencrypted column.
column_encryption_key_ordinal denotes which key in the first result set (where there are more than one) should be used.

Armed with all this information and with the CEK now decrypted, the client driver can encrypt the parameters as required, and finally issue the query to SQL Server. This is the second query we saw in our XEvent Profiler capture:

exec sp_executesql N'DECLARE @LastName AS NVARCHAR (50) = @pacc87acf4618488b80bc61f6ac68114f; DECLARE @FirstName AS NVARCHAR (50) = @p4113aa748f2e4ff585556f8eaa618f0d; INSERT INTO dbo.EncryptedTable (LastName, FirstName) VALUES (@LastName, @FirstName); ',N'@pacc87acf4618488b80bc61f6ac68114f nvarchar(50),@p4113aa748f2e4ff585556f8eaa618f0d nvarchar(50)',@pacc87acf4618488b80bc61f6ac68114f=0x01F82323F52B604A838ABC880ECDEB6CDD26ED47813F507A2EAA78FA1EE10FF47B2ED7C73C1A76580B6C0753A95DF5C944C5E590C2ED7E0AF59F1B4054317018584A9B8E3B4B0D9C4341B32DE2990E22C1,@p4113aa748f2e4ff585556f8eaa618f0d=0x012AC8899AACB8F1DDCEF4F6B2EB090F5E56687FDBB67D237E0E3D6D91C7F96C29F39396C633FB27DD92C7F2FABC18600D154FE1D426000CDB401ECD8BFD04AAC3

Here the parameterized version of our query has been issued to SQL Server with the plaintext parameters encrypted ready for insertion into the table. The long binary values at the end of the query represent the values “Matthew” and “McGiffen” encrypted using the correct CEK.

Before we move on to other queries, let’s pause and think about what this means for us in practice. The key thing to understand is that this call to sp_describe_parameter_encryption and the encryption of our parameters within the client driver is core to how Always Encrypted works. Plaintext values that target encrypted columns in the database must be sent via parameters. If your application only interacts through the database via parameterized queries – which is good practice – then you are in a good starting place in terms of implementing Always Encrypted. Otherwise, you are likely to have significant refactoring to do.

Another item worth discussing is the extra call that executing your query entails. It might seem this is going to add performance overhead. In reality the call is very lightweight with CPU consumption of less than a millisecond. On top of that, the results get cached locally so you don’t often see repeated calls to get the metadata for the same query. As a consequence of those two points, the performance overhead is generally negligible.

What Happens with a Select Query

Let’s repeat the exercise above and look at what happens when we execute our select query. We’ll start a new XE Profiler session and look at what is captured when we run the following select query:

SELECT * FROM dbo.EncryptedTable;

Here we see the XE Profiler output:

In this case we’re not seeing the extra call to get the metadata which can seem slightly confusing. Surely the client driver needs that information to be able to decrypt the encrypt columns that are returned. The reality is that that metadata does get sent back, but it doesn’t need to be asked for. The fact that our connection has encryption enabled means that SQL Server knows it needs to supply that information back to the client, so it does so automatically. In the former case with an insert, we needed the metadata before we could issue the query which is what necessitated the extra call.

For the select query, the results are returned to the client encrypted, along with the metadata. The client driver then decrypts any columns if required to do so before passing the results back to the client application.

Summary

We can see it is a fairly simple mechanism that supports executing queries against encrypted columns. A combination of automated actions in the client driver with calls to sp_describe_parameter_encryption to retrieve encryption metadata – such as which columns are encrypted, the encrypted Column Encryption Key (CEK) and the location of the Column Master Key (CMK).

It’s not strictly necessary to understand this – but I think it really helps in making clear why there are restrictions in place on how you can interact with encrypted columns, as well as to understand why certain things you may want to attempt will or wll not work.

Matthew McGiffen Data

Author: Matthew McGiffen

What I Love About My Job

T-SQL Tuesday 167 – Encryption and Data Protection Roundup

T-SQL Tuesday: Encryption and Data Protection

Manually updating Statistics in SQL Server

UPDATE STATISTICS

FULL SCAN or SAMPLE

sp_UpdateStats

Ola Hallengren

Statistics and the Ascending Key Problem

The Problem

So what’s the fix?

So should you turn on the traceflag(s) globally?

Welcome to SQL Server 2014 (and later)

When do Statistics get updated?

What are Statistics in SQL Server?

Cardinality Estimation

Statistics

Stats Header

Recordset Number 2: The Density Vector

Recordset Number 3: The Histogram

Conclusions

Indexes and Statistics with Always Encrypted

Under Promise and Overdeliver

What Happens in the Background when Executing Queries with Always Encrypted

What Happens with an Insert Query

What Happens with a Select Query

Summary