Debugging Outside Your Comfort Zone: Diving Beneath a Trusted Abstraction

Key Takeaways

  • Even unfamiliar parts of our software stacks are ultimately just software that we can turn our debugging skills to – there’s no magic.
  • Simple config which works fine in normal situations can lead to disaster in the presence of coordinated failures.
  • Automated orchestration is a useful and necessary part of running production systems, but does reduce our understanding of how those systems work.
  • When you’re trying to reproduce an issue in any kind of test environment, always remember to question how well that environment matches production.
  • There are some lessons better learned from field reports than textbooks, so please share your debugging stories!

Today we’re going to take a deep dive through a complex outage in the main database cluster of a payments company. We’ll focus on the aftermath of the incident – the process of understanding what went wrong, recreating the outage in a test cluster, and coming up with a way to stop it from happening again. Along the way, we’ll dive deep into the internals of Postgres, and learn a little about how it stores data on disk.

What happened

This outage wasn’t subtle. The database that powered all the core parts of our product – a payments platform – was down. We very quickly went from everything working to nothing working, and the pager was about as loud as you’d expect.

Knowing what was broken was easy. An incident like this produces plenty of alerts that point you in the right direction. When we logged onto the primary node in our Postgres cluster, we saw that the RAID array which held the data directory had failed. Not great, but it shouldn’t have caused an outage like this, as we were running a three-node cluster of Postgres servers managed by Pacemaker.

What was more difficult to track down was why Pacemaker hadn’t failed over to one of the replicas like we expected. We spent some time looking through its logs and using its CLI to try and get it to retry the failover, but we couldn’t manage to recover it.

Restoring the system to a working state

We spent about an hour trying to get the cluster automation working again, which is already a long outage for a payments platform. At that point, we had to change tack, and decided to ditch the Pacemaker cluster automation and promote one of the replicas to primary by hand.

When I say “by hand”, I mean it. We SSH’ed into the two remaining replicas, switched off the Pacemaker cluster, and promoted one of them to be a new primary. We reconfigured our application servers to connect to that server, which brought the site back up.

With the product working again, we could at least briefly breathe a sigh of relief, but we knew that we were going to have to figure out what happened with the cluster automation and how to fix it.

Investigating and recreating the outage

This ended up being the most involved outage of my career so far – hopefully it stays that way.

All in all, we spent around two weeks figuring out what went wrong and how to fix it. Since this was such a high impact issue, we were really keen to understand exactly what happened and fix it.

We split our Core Infrastructure team of five into two sub-teams: one focused on recreating the incident in an isolated test setup, so that we could make changes to prevent it from happening in future, and the other focused on safely restoring us from our now hand-managed cluster back to a setup managed by the clustering software.

One thing that made it tricky was that we needed a test environment which would let us rapidly perform experiments to test different hypotheses. We did get slightly lucky there: only a few months before, we’d been working on improvements to our cluster automation setup, and as part of that we’d put together a Docker Compose version of the cluster setup.

We settled on using that containerised setup to run our experiments. It made it easy to spin up a fresh environment, run and then inject different combinations of faults into it to try and reproduce the outage, then tear it down in a matter of a minute or so.

From there, we gathered as much information as we could from the production logs we captured on the day of the incident. Within them, we found five potential avenues of investigation:

  1. The RAID 10 array on the primary node simultaneously losing 3 of its 4 disks, and becoming unusable
  2. The kernel setting the database’s filesystem read-only on that node
  3. The clustering software (Pacemaker) detecting the failure of the primary, but not failing over to a replica
  4. A crash in one of the subprocesses of the synchronous replica, which caused it to restart
  5. A suspicious log line about replication data received by the synchronous replica

From there it was a case of writing scripts to inject the same faults into our test cluster. While those scripts weren’t written with the intention of being easy to run by people outside of the team, you can see a slightly tidied up version of that script here if you’re curious exactly what we did.

It took us a lot of trial-and-error, but by making small adjustments to the faults we were injecting, we eventually arrived at that script and were able to reproduce the exact same failure we’d seen in production.

The starting point: low-hanging fruit

When you’re presented with several different leads in an issue you’re debugging, you need some way of narrowing down and focusing on the ones that are most likely to get you to an answer in a reasonable amount of time.

There are plenty of more subtle factors involved, but I find that a good rule of thumb for evaluating an avenue of debugging is to consider how likely it is for that avenue to be a contributing factor in an issue, and how much effort it will take to investigate.

The best leads to pursue are high likelihood and low effort. Even if they end up being wrong, or just insufficient in isolation, you don’t waste much time looking into them. Conversely, you’ll want to move anything with a low likelihood of contributing to the incident and high effort to recreate lower down the list. Everything else lives somewhere in the middle, and prioritising those is slightly more down to gut feeling.

In our case, we believed it was highly unlikely that the precise way in which the primary failed (the RAID array breaking) was a necessary condition for automated failover not to work in our Pacemaker setup. It was also high effort to recreate – potentially well beyond any reasonable effort (it could have been down to a firmware bug in a RAID controller, or a brief dip in power in the machine). That left us with a list that looked like this:

  1. The RAID 10 array on the primary node simultaneously losing 3 of its 4 disks, and becoming unusable
  2. The kernel setting the database’s filesystem read-only on that node
  3. The clustering software (Pacemaker) detecting the failure of the primary, but not failing over to a replica
  4. A crash in one of the subprocesses of the synchronous replica, which caused it to restart
  5. A suspicious log line about replication data received by the synchronous replica

Avenues 3-5 all seemed relatively likely contributors, but 5 was going to be high effort, so we started with 3 and 4.

The first version of the script we wrote to try and reproduce the outage did just two things after setting up the database cluster:

# on primary - forceful kill of main process
# (very rough approximation of hard failure)
kill -SIGKILL 

# on synchronous replica - subprocess crash
# (same signal as seen in production logs)
kill -SIGABRT 

Sadly, it wasn’t going to be that easy – those two actions weren’t enough to reproduce the outage. We’d need to turn our attention to lead number 5 – the suspicious message we’d seen about a replication issue.

The logs that had us scratching our heads

There was a pair of log lines that really jumped out at us when we were reviewing all the logs we’d pulled from production on the day of the incident, and had us stumped for a while:

2023-02-24 17:23:01 GMT LOG: restored log file "000000020000000000000003" from archive

2023-02-24 17:23:02 GMT LOG: invalid record length at 0/3000180

One of the Postgres replicas was complaining about some data it had received via replication (specifically through the WAL archive/restore mechanism, which we were using alongside streaming replication). We thought that might be part of what caused the clustering software not to fail over to a replica when the primary failed, and we wanted to understand exactly what was causing this error to be logged on the replica.

Figuring out exactly what caused it meant diving deep into the Postgres internals, way below the abstractions we’d taken for granted in many years of running it in production. Specifically, we needed to figure out how to take a chunk of the write-ahead logs (WALs), break it in a very specific way, and inject it into the synchronous replica. If we saw the same log line, we’d know we’d done it right.

After a mixture of looking through the Postgres source code, and staring at WAL files from our test cluster in a hex editor, we were able to piece together enough information to know what we needed to add to our reproduction script. Before we look at the script itself, let’s take a quick run through what we found.

First off, we found the part of the Postgres source code where that log line was being emitted:

    /* XXX: more validation should be done here */
    if (total_len < SizeOfXLogRecord)
        report_invalid_record(state, "invalid record length at %X/%X",
                              (uint32) (RecPtr >> 32), (uint32) RecPtr);
        goto err;
    gotheader = false;

Postgres source – tag REL9_4_26 – src/backend/access/transam/xlogreader.c:291-300

It’s a reasonably straightforward check for the validity of the XLog – another name for the transaction log or write-ahead log.

To reach the error handling code which produced the log line we were interested in, we’d need to make the if (total_len < SizeOfXLogRecord) conditional evaluate to true. That total_len field is initialised here by retrieving the first field of the XLogRecord struct.

record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
total_len = record->xl_tot_len;

Postgres source - tag REL9_4_26 - src/backend/access/transam/xlogreader.c:272-273

typedef struct XLogRecord
    uint32        xl_tot_len;        /* total len of entire record */
    TransactionId xl_xid;        /* xact id */

Postgres source - tag REL9_4_26 - src/include/access/xlog.h:41-44

We knew that this struct was being initialised with data from the WAL files on disk, so we turned there to try and figure out how we could deliberately inject a fault that reached this error handler.

We started out by looking at a WAL file generated by inserting some example data to a database.

[Click on the image to view full-size]

We could see that our data was there in plain-text, surrounded by the various other fields from the XLogRecord struct, encoded in a binary format. We knew the length field would be in there somewhere, but we weren’t sure where to find it.

That was when we had our "aha!" moment. If we inserted data which increased in length by one character at a time, we could look for part of the binary stream that was incrementing at the same time. Once we did that, the length field was staring us right in the face:

[Click on the image to view full-size]

All we needed to do was change that part of the replication data on-the-fly in our test cluster, and we’d be able to recreate point 5 from our list above.

With our new understanding of the on-disk data structure, we were able to write a small script to inject the fault we needed. The code was pretty janky, and used a regular expression to replace the correct bytes, but it did the job. This wasn’t code that would be needed for long, and it was never going anywhere near a production system.

Now we had everything we needed to test another one of our debugging leads.

Into the depths: reflections on diving through abstraction layers

I know that right now you’re desperate to hear how it went, but I want to quickly talk about something that’s easy to miss in all the technical detail: how it feels to work on an incident like this.

I didn’t think about it that way at the time, but in hindsight it struck me just how unusual an experience it was. It’s weird looking deep into the internals of a system you’ve used for years - in this case, seeing how exactly your SQL INSERT statement gets transformed into data written to disk.

Up until this incident, I’d been able to treat the WALs as an abstraction I never needed to look beneath. Sure, I’d configured Postgres to use them in various ways, but I’d never needed to know the exact format of the data stored inside them.

It’s also a strange feeling to work on something that’s really technically interesting, but with a strong background sense of urgency. Ultimately, we were trying to understand an incident that had a really serious business impact, and we needed to stop it from happening again.

You want to go off exploring every new thing you read about - especially when I found a Postgres internals ebook with a detailed chapter on write-ahead logging - but you have to rein that curiosity in and focus on making sure that outage doesn’t happen again.

The missing 6th lead: Pacemaker config

Right, back over to our work to recreate the outage.

With the extra break-wal section added to our scripts, we ran the whole thing again, and to our dismay, the Pacemaker cluster in the test setup successfully failed over to the synchronous replica. Clearly, we were still missing something.

After a lot more head scratching, one of our teammates noticed a difference between our local test cluster (running in Docker Compose) and the clusters we ran in staging and production (provisioned by Chef). We’d recently added a virtual IP address, managed by Pacemaker, which was set to run on either of the two Postgres replicas (i.e. not on the Primary). The reason we’d introduced it was so that we could take backups without adding extra load on the primary. The way you express that in Pacemaker config is through a colocation constraint:

colocation -inf: BackupVIP Primary

That constraint can roughly be translated into English as "never run the BackupVIP on the same node as the Primary Postgres instance". It turned out that adding that VIP with its colocation constraint to the test cluster was enough to reproduce the outage.

Once we were able to reliably reproduce the outage we’d seen in production on our containerised test setup, we were able to use that to test different hypotheses about how we’d configured Pacemaker.

Eventually, we found that it was the combination of the way we’d written that colocation constraint with another setting which we’d had in the cluster for much longer:

default-resource-stickiness = 100

The combination of those two pieces of config, plus a co-ordinated failure of the primary Postgres instance and crash on the synchronous replica was enough to put the cluster into a state where it would never successfully promote the synchronous replica, because Pacemaker’s constraint resolution couldn’t figure out where everything should run.

Through more experimentation, we were able to find a looser colocation constraint which didn’t confound Pacemaker’s constraint resolution code in the face of the same process crashes:

colocation -1000: BackupVIP Primary

Loosely translating this into English, it says "avoid running the BackupVIP on the Primary Postgres instance, but don’t worry if it’s there briefly".

We spent a bit more time validating that new configuration, and then copied it over to our config management system and rolled it out to production.

Narrowing down: finding the minimal set of faults

At this point, the reproduction script that had got us to our answer was simulating four conditions from the outage:

  1. The RAID 10 array on the primary node simultaneously losing 3 of its 4 disks, and becoming unusable
  2. The kernel setting the database’s filesystem read-only on that node
  3. The clustering software (Pacemaker) detecting the failure of the primary, but not failing over to a replica
  4. A crash in one of the subprocesses of the synchronous replica, which caused it to restart
  5. A suspicious log line about replication data received by the synchronous replica
  6. The BackupVIP running on the synchronous replica

We were curious which of those conditions were strictly necessary to recreate the outage, and if any of them were incidental. After selectively removing them one at a time, it turned out that the error in the replication data was a complete red herring. A minimal recreation of the incident needed only three conditions:

  • The clustering software (Pacemaker) detecting the failure of the primary, but not failing over to a replica
  • A crash in one of the subprocesses of the synchronous replica, which caused it to restart
  • The BackupVIP running on the synchronous replica

We could only laugh about how much time we’d spent in the weeds of the Postgres source code and staring at a hex editor to learn how to inject a fault that wasn’t even necessary to cause the outage. The consolation prize was that we’d learned a lot about how the database at the core of our most important services worked, which was valuable in itself.

What we learned from the outage

Naturally, the team learned a whole bunch about Postgres internals. That was cool, and very interesting to me personally as databases and distributed systems are my favourite topics in computing, but there were also some broader lessons. We could all be slightly more confident in our ability to run it well, and debug it more effectively when things went wrong.

Something I found really striking was how seemingly simple configuration can lead to complex behaviour when it interacts with other configuration. Those two pieces of Pacemaker config which worked perfectly fine by themselves - and even worked perfectly fine under simple failure conditions - seemed innocuous until we were more than a week into the investigation. Yet with just the right combination of simultaneous failures and the BackupVIP being on the synchronous replica (rather than the asynchronous one, which the constraint would also allow), they led to complete disaster.

The other one, which scares me to this day, is just how much automation can erode your knowledge. We spent the first hour of the outage trying to get the cluster automation working again, and I think a big part of that was that we’d relied on that automation for years - we hadn’t stood Postgres up by hand for a long time. If we’d been more comfortable doing that, perhaps we would have switched up our approach earlier in the incident.

It’s tricky though, because it’s somewhat inherent to building self-healing systems. The best suggestion I’ve got is to occasionally run game days to improve your team’s ability to respond to issues in production. They’re something we’d started doing about a year before this outage, but there’s only so much ground you can cover, and production systems will always find new ways to surprise you.

With the benefit of hindsight, a game day that involved standing Postgres back up without the cluster automation would have been great preparation for an outage like this, but it’s hard to predict the future.

Beyond databases: how to debug complex issues

The big thing you need is persistence. You’re going to go down many dead ends before you get to an answer, and getting there is going to mean working outside of your comfort zone. Debugging requires a different set of muscles to writing software, and it can be hard to stay motivated for long periods of time when a system is doing something that doesn’t make sense. It’s definitely something that gets better with practice.

It’s so easy to see a write-up of an issue and think the team working on it solved it in a few hours. The hardest issues are far more complex, and can take days or even weeks to understand.

I’d also like to call out something really important when working in a test environment: keep questioning whether it actually matches the production environment where the failure happened. It’s so easy to spend a bunch of time going in the wrong direction because of a difference like that.

If the issue you’re trying to debug is amenable to it, you might be better off debugging in production. You can go a long way by adding instrumentation into your running application, and can minimise risk to users with techniques like traffic shadowing and feature flags (e.g. by only enabling a buggy feature for a small handful of users). Sadly, that wasn’t an option in this case as we were trying to diagnose an issue that caused a full database outage.

Tell me more

If you’ve enjoyed reading about this outage investigation, and you want even more details, there are two places you can go:

The talk in particular covers some extra background on Postgres’ replication mechanisms, and spends longer stepping through the trial-and-error process of figuring out how to inject a fault into the replication data.

How we can get better together

Something I’m really passionate about is learning in the open. The kind of things you learn when someone talks about how they debugged a weird issue are different from what you tend to learn from more formal materials (books, courses, etc) - not to say the latter aren’t valuable too! People directly sharing their experience tend to focus on very concrete actions they took, and surround them in the context that they were working in.

I get it. Sometimes there are legal considerations when sharing work, or even just practical constraints on how much time we have to do so. There’s probably some way you can talk about a problem you solved that tonnes of people would find interesting.

It doesn’t have to be as big a commitment as writing a presentation, or even a short blog post. It can be as small as a social media post about a bug you investigated which adds some commentary.

So if you’ve made it this far, that’s what I’ll ask of you. Next time you run into a weird problem that was interesting to debug, share it somewhere. You may be surprised how many people learn something from it.

I’m going to shamelessly take an idea from the Elixir community, where people regularly use the #MyElixirStatus hashtag to talk about what they’re doing with the language. If you do post about something interesting you debugged on social media because of this article, tag it with #MyDebuggingStatus. I’ll try to keep an eye out.