Saturday, April 23, 2022

Investigation and fix: A SQL Server stored procedure slowdown

What happened?

At work this past week, I was alerted to a problem where a background job which processes incoming messages from a particular queue had slowed down dramatically. As a result, a significant delay had developed in providing certain event notifications to our customers.  

Our monitoring tooling showed that the culprit was a particular SQL Server stored procedure -- one that gets invoked as a part of processing each message from the queue. Execution time of that sproc had suddenly increased from near-instant to an average of 4 seconds, with spikes up to around 20 seconds. (The queue processing job is multi-threaded, so we were still processing about one message per second, but that was no longer keeping up with the incoming volume of messages during the busy hours of each day.)

No code deployments or changes to the environment had been made around the time of the initial slowdown; this problem had started occurring without an obvious root cause. 

Our investigation

The stored procedure in question comprises about a dozen individual SQL statements and queries. It wasn’t immediately clear which individual statement in the sproc was responsible for the slowness.

After a few initial investigative paths didn’t yield results, I decided to run a short trace on the live production database. The trace revealed that one particular SELECT query was consistently taking a long time to run during each run of the stored procedure.

Interestingly: When I ran that same SELECT query SQL manually, it completed near-instantly. However, when I ran some minor variations on that SELECT query, they did take a long time to execute.

Execution plans and indexes

SQL Server internally generates an “execution plan” for each query that it runs, whereby it decides how it will go about performing the operation that the query's SQL calls for. The SQL Server Management Studio tool is able to display these execution plans in a nice graphical format.

When the problematic SELECT SQL was run manually, the execution plan showed that it was – correctly – leveraging an existing index on the target table. That index had been originally added to support this specific query. It includes fields needed by the query’s lookup criteria.

Looking back at the results of the trace, the problematics SELECT queries being run during execution of the stored procedure were being run using a different execution plan – not leveraging the custom-designed index at all! A full table scan was being run by SQL Server to gather the query results. As that  table currently sits at a total row count of over 150 million, those queries were taking a while to run!

The specific SQL being run (simplified a bit here for readability) was:

SELECT TOP 1 ID
FROM myTable t
WHERE t.CreationDate > DATEADD(day, -30, GetDate())
  AND (t.Email = @Email OR t.PhoneNumber = @PhoneNumber)
  AND t.CompanyID = @CompanyID)
ORDER BY t.ID ASC

The custom index includes the following fields:
  •     CompanyID
  •     Email
  •     CreationDate
  •     PhoneNumber

Can you spot the problem?



Take a look at the ID field. That field isn’t part of the index; but it is being used in the SQL’s ORDER BY clause! And that clause, as it turns out, is used by the database as one of the factors in how it seeks for the single (TOP 1) row to return.

Through its own inscrutable internal algorithms, prior to April, SQL Server had been using an execution plan that did leverage the custom index for this query, despite the presence of the ID field in the ORDER BY clause. After April, some tipping point in the database’s characteristics was evidently hit that made SQL Server decide to no longer use that index in its execution plan during runs of the stored procedure.

The fix was to adjust the SELECT query to ORDER BY CreationDate, rather than by ID. This had the same logical effect with regard to selection of the returned row; and it got SQL Server to once again use our custom index while running the stored procedure, resulting in the resumption of speedy performance in processing of messages from the queue.

How did we verify that the fix worked?

  • We observed that new records continued to be inserted correctly (and now, a lot more quickly!) into the database table.
  • Execution times for the stored procedure went way down. We could see this both from the database itself, and from the our queue's web interface – the green line here is the rate of messages being processed off the queue, with the fix having been applied just after the 10:20 mark:

So what are the lessons here?

  • Having the ability to determine which specific SQL statement was the root cause of the downstream slowness was key to being able to devise a fix. (The fact that our monitoring instrumentation only reported to the granularity of which stored procedure was running slowly -- but not which individual SQL statement within the sproc was slow -- made this trickier.)
  • The availability of multiple weeks of monitoring data from our instrumentation did make it quick and obvious to confirm that there had been a dramatic change in performance from one of our stored procedures, which was the cause of the problem.
  • When a particular SELECT statement is slow, ensuring that a properly-matching database index is in place – and that index is being applied to live queries in the expected manner – is a very good thing to check!

No comments:

Post a Comment

Non-spammers: Thanks for visiting! Please go ahead and leave a comment; I read them all!

Attention SPAMMERS: I review all comments before they get posted, and I REPORT 100% of spam comments to Google as spam! Why not avoid getting your account banned as quickly -- and save us both a little time -- by skipping this comment form and moving on to the next one on your list? Thanks, and I hope you have a great day!