Thursday, August 13, 2020

Incrementally troubleshooting production server issues as sculpting marble

Near the beginning of Steve McConnell's classic book on software construction, Code Complete -- it's the one where Jeff Atwood's "Coding Horror" icon image originally came from -- several analogies for the process of creating software are considered, including penmanship, farming, and oyster pearl harvesting, before McConnell finally settles on "construction."

At my job, although I'm primarily a software developer, I do also occasionally put on the hats of DevOps specialist and DBA. In particular, this happens when a legacy server that I'm responsible for experiences production issues, and I'm one of the few with the knowledge and experience to figure out what's wrong, and get it up and running smoothly again.

This morning, while making another pass at troubleshooting a stubborn partial outage that had recently started cropping up on a near-nightly basis, where performance of the production website and/or database was being degraded in the early morning hours for a period of around an hour or two, another analogy for the type of production software maintenance occurred to me: that of sculpting!

Michelangelo's "David"
 
In troubleshooting this outage, my colleague and I have taken the approach over the past several days of of making a single change that might hopefully resolve the issue; then, waiting to see if the issue still recurred. If so, we'd make another change.
 
It occurred to me that this was not so dissimilar from the quote attributed to the sculptor Michelangelo saying, in reference to beginning a sculpture, that the desired figure is already in place; the excess marble that does not comprise the statue just needs to be chiseled away.
 
By way of example, here are some of the incremental changes we attempted, in working towards resolving our persistent nightly outage: 
  • Monitoring the performance of the database during the outage, we observed that a particular stored procedure was comprising much of the runtime.  We optimized that stored procedure by improving a couple of the existing table column indexes used in the sproc's query plan, to reduce the overall number of lookups the stored procedure needed to perform.
  • We altered one of our nightly scheduled jobs that was running concurrent with the outages to avoid calling the possibly-problematic stored procedure at all; instead, we fulfilled the job's data needs by adding some additional output fields to another query the job was already making.
  • Examining the raw server logs of incoming HTTP requests, we observed that a number of poorly-behaved bots (no user-agent strings; no reasonable rate limiting) were hitting our site very aggressively during the outage period -- and that the pages being hit were triggering calls to that same stored procedure. As at least an interim fix, we put rules in place to prevent the pages being hit from being rendered in response to those particular requests.

Each of these steps of trimming off poorly-performing code and problematic incoming requests has felt a bit like chiseling away some stone, hopefully in the end arriving at a nice statue -- or in our case, a stable production system!

No comments:

Post a Comment

Hi spammers! No need to waste your time here; comments are heavily moderated, so if you like, you can save us both a little time and just move on to the next site. :-)

For everyone else: Thanks for visiting! Your comments are more than welcome!