Debugging Production: eBPF Chaos

Key Takeaways

  • eBPF helps with access to observability data in microservice container environments that are otherwise hard to fetch.
  • Developers benefit from auto-instrumentation for performance monitoring, profiling, tracing.
  • Verification of tools and platforms is required, breaking production with chaos engineering to verify eBPF enabled workflows.
  • Security observability with eBPF is a cat and mouse game, with malicious actors learning to circumvent security policies.
  • Chaos engineering can benefit from using eBPF probes to manipulate behavior.

This article shares insights into learning eBPF as a new cloud-native technology which aims to improve Observability and Security workflows. The entry barriers can feel huge, and the steps to using eBPF tools to help debug in production can be many. Learn how to practice using existing tools and tackle challenges with observability storage, alerts and dashboards. The best tools are similar to a backup – if not verified working, they are useless. You’ll learn how chaos engineering can help, and get an insight into eBPF based observability and security use cases. Breaking them in a professional way also inspires new ideas for chaos engineering itself. Work to be done, and risks are discussed too, followed by a wishlist for future debugging in production improvements.

Getting started with eBPF

There are different ways to start using eBPF programs and tools. It can feel overwhelming with the many articles and suggestions. As a first step, define a use case and problem to solve. Would it be helpful to get more low-level monitoring metrics to troubleshoot product incidents faster? Maybe there is uncertainty about the security in a Kubernetes cluster; are there ways to observe and mitigate malicious behaviors? Last but not least, consider microservice container environments a good way to inspect behavior and data – often it is complicated to gain access, compared to reading files on a virtual machine running with a monolithic application.

Debug and troubleshoot production: eBPF use cases

Let’s look into practical use cases and tools to get inspired for debugging situations, and also figure out how to verify that they are working properly.

Observability can benefit from eBPF with additional data collection that gets converted into metrics and traces. Low-level kernel metrics can be collected with a custom Prometheus Exporter, for example. A more developer observability-focused approach is auto-instrumentation of source code to gain performance insights into application;  Pixie provides this functionality through a scriptable language for developers. Coroot implements Kubernetes service maps using eBPF, tracking the traffic between containers, and provides Ops/SRE production health insights. Continuous Profiling with Parca is made possible with function symbol unwinding techniques to trace calls and performance inside the application code.

Security is another key area where eBPF can help: detecting unauthorized syscalls,  connections that call home, or event syscalls hooked by malicious actors like Bitcoin miners or rootkits. A few examples are: Cilium provides security observability, and prevention with Tetragon. Tracee from Aqua Security can be used on the CLI and in CI/CD, to detect malicious behavior. Falco is a Kubernetes threat detection engine, and is likely the most mature solution in the cloud-native ecosystem. A special use case was created by the GitLab security team to scan package dependency install scripts (package.json with NodeJS for example) for malicious behavior to help prevent supply chain attacks.

SRE/DevOps teams require the right tools to debug fast, and see root causes on a more global view. Toolsets like Inspektor Gadget help in Kubernetes to trace outgoing connections, debug DNS requests, and more. Caretta can be used to visualize service dependency maps. Distributing eBPF programs in a safe way also requires new ideas: Bumblebee makes it easier by leveraging the OCI container format to package the eBPF programs for distribution.

Observability challenges

There are different observability data types that help debugging production incidents, analyze application performance, generally get to understand known questions, and see potential unknown unknowns. This requires different storage backends, leading to DIY fatigue. With additional types and sources introduced with eBPF probe data which cannot be converted to existing data, this can get more complex. Unified Observability datastores will be needed, scaling to store a lot of data.

What data do I really need to solve this incident? Troubleshoot a software regression? Finding the best retention period is hard. Self-hosting the storage backend is hard, too, but SaaS might be too expensive. Additionally, cost efficiency and capacity planning will be needed. This can also help estimate the storage growth for Observability data in the future. The GitLab infrastructure team built the open-source project Tamland to help with capacity planning and forecasting.

More benefits and insights gained from eBPF data also requires integration into alerts and dashboards. How to reduce the amount of public facing incidents, and detect and fix problems with more insights available? The overall health state, cross-references to incident management data, and “all production data” is needed to be able to run anomaly detection, forecast and trend calculations – because in the world of microservices, there is no answer to, “Is my service OK?” anymore.

A green dashboard with all thresholds showing “OK” does not prove that alerts are working, or that critical situations can be resolved fast enough from dashboard insights. The data collection itself might be affected or broken too, an eBPF program not behaving as intended, or a quirky kernel bug hitting production. This brings new and old ideas to simulate production problems: Chaos Engineering. Break things in a controlled way, verify the service level objectives (SLOs), alerts and dashboards. The chaos frameworks can be extended with own experiments, for example Chaos Mesh for cloud native chaos engineering, or the Chaos Toolkit integrated into CI/CD workflows. The patterns from chaos engineering can also be used for injecting unexpected behavior, and running security tests, because everything behaves differently with random chaos, even security exploits.

Let’s break eBPF Observability

Tools and platforms based on eBPF provide great insights, and help debugging production incidents. These tools and platforms will need to prove their strengths and unveil their weaknesses, for example, by attempting to break or attack the infrastructure environments and observe the tool/platform behavior. At a first glance, let’s focus on Observability and chaos engineering. The Golden Signals (Latency, Traffic, Errors, Saturation) can be verified using existing chaos experiments that inject CPU/Memory stress tests, TCP delays, DNS random responses, etc.

Another practical example is the collection of low-level system metrics which includes CPU, IO, memory. This can be performed using the eBPF events to Prometheus exporter. In order to verify the received metrics, existing chaos experiments for each type (CPU, IO, memory stress, delays) can help to see how the system behaves, and if the collected data is valid.

[Click on the image to view full-size]

Developers can benefit from Pixie which provides auto-instrumentation of application code, and also creates service maps inside a Kubernetes cluster. In order to verify the maps showing correct data, and the traces showing performance bottlenecks, add chaos experiments that cause stress tests and network attacks. Then it is possible to specifically see how the service maps, and traces change over time, and take action on identified problematic behavior, before a production incident with user facing problems unveils them.

[Click on the image to view full-size]

For SREs, Kubernetes troubleshooting can be helped by installing the Inspektor Gadget tool collection to overcome the limitations of container runtime blackboxes. Inspektor Gadget uses eBPF to collect events and metrics from container communication paths, and maps low-level Linux resources to high-level Kubernetes concepts. There are plenty of gadgets available for DNS, network access, and out-of-memory analysis. The tool’s website categorizes them into advise, audit, profile, snapshot, top and trace gadgets. You can even visualize the usage and performance of other running eBPF programs using the “top ebpf” gadget. The recommended way of testing their functionality is to isolate a tool/command, and run a chaos experiment that matches, for example a DNS chaos experiment that returns random or no response to DNS requests.

More visual Kubernetes troubleshooting and observability can be achieved by installing Coroot, and using its service maps auto-discovery feature. The visual service map uses eBPF to trace container network connections, and aggregates individual containers into applications by using metrics from the kube-state-metrics Prometheus exporter. The service map in Coroot is a great target for chaos experiments – consider simulating broken or delayed TCP connections that influence services to vanish from the service map, or increase the bandwidth with network attack simulation and verify the dashboards under incident conditions. OOM kills can be detected by Coroot too, using the underlying Prometheus monitoring metrics – a perfect candidate for applications that leak memory. A demo application that leaks memory but only when DNS fails, is available in my “Confidence with Chaos for your Kubernetes Observability” talk to test this scenario specifically.

[Click on the image to view full-size]

Continuous Profiling with Parca uses eBPF to auto-instrument code, so that developers don’t need to modify the code to add profiling calls, helping them to focus. The Parca agent generates profiling data insights into callstacks, function call times, and generally helps to identify performance bottlenecks in applications. Adding CPU/Memory stress tests influences the application behavior, can unveil race conditions and deadlocks, and helps to get an idea of what we are actually trying to optimize.

OpenTelemetry supports metrics next to traces and logs as data format specification. There is a new project that provides eBPF collectors from the Kernel, inside a Kubernetes cluster, or on a hyper cloud. The different collectors send the metrics events to the Reducer (data ingestor), which either supports providing the metrics as a scrape endpoint for Prometheus, or sends them using gRPC to an OpenTelemetry collector endpoint. Chaos experiments can be added in the known ways: stress testing the systems to see how the metrics change over time.

[Click on the image to view full-size]

Last but not least – some use cases involve custom DNS servers running as eBPF programs in high performance networks. Breaking DNS requests can help shed light into their behavior too – it is always DNS.

Changing sides: breaking eBPF security

Let’s change sides and try to break the eBPF security tools and methods. One way is to inject behavioral data that simulates privilege escalation, and observe how the tools react. Another idea involves exploiting multi-tenancy environments that require data separation, by simulating unwanted access.

Impersonating the attacker is often hard, and when someone mentioned “tracing syscalls, hunting rootkits event”, this got my immediate attention. There are a few results when searching for Linux rootkits, and it can be helpful to understand their methods to build potential attack impersonation scenarios. Searching the internet for syscall hooking leads to more resources, including a talk by the tracee maintainers about hunting rootkits with tracee with practical insights using the Diamorphine rootkit.

Before you continue reading and trying the examples, don’t try this in production. Create an isolated test VM, download and build the rootkit, and then load the kernel module. It will hide itself and do everything to compromise the system. Delete the VM after tests.

Calling the Tracee CLI was able to detect the syscall hooking. The following command allows running tracee in Docker itself, in a privileged container that requires a few mapped variables and the event to trace, ‘hooked_syscalls‘.

$ docker run 
  --name tracee --rm -it 
  --pid=host --cgroupns=host --privileged 
  -v /etc/os-release:/etc/os-release-host:ro 
  -v /sys/kernel/security:/sys/kernel/security:ro 
  -v /boot/config-`uname -r`:/boot/config-`uname -r`:ro 
  -e LIBBPFGO_OSRELEASE_FILE=/etc/os-release-host 
  -e TRACEE_EBPF_ONLY=1 
  aquasec/tracee:0.10.0 
  --trace event=hooked_syscalls

The question is how to create a chaos experiment from a rootkit? It is not a reliable chaos test for production environments, but the getdents syscall hooking method from the Diamorphine rootkit could be simulated in production, to verify if alarms are triggered.

Cilium Tetragon works in similar ways to detect malicious behavior with the help of eBPF, and showed new insights into the rootkit behavior. The detection and rules engine was able to show that the rootkit’s overnight activities expanded to spawning random name processes on a given port.

$ docker run --name tetragon 
   --rm -it -d --pid=host 
   --cgroupns=host --privileged 
   -v /sys/kernel/btf/vmlinux:/var/lib/tetragon/btf 
   quay.io/cilium/tetragon:v0.8.3 
   bash -c "/usr/bin/tetragon"

[Click on the image to view full-size]

Let’s imagine a more practical scenario: a bitcoin miner malware that runs in cloud VMs, Kubernetes clusters in production, and in CI/CD deployment environments for testing, staging, etc. Detecting these patterns is one part of the problem – intrusion prevention is another, unanswered question yet. Installing the rootkit as a production chaos experiment is still not recommended – but mimicking the syscall loading overrides in an eBPF program can help test.

This brings me to a new proposal: create chaos test rootkits that do nothing but simulation. For example, hooking into the getdents syscalls for directory file listings, and then being able to verify all security tools hopefully detecting the simulated security issue. If possible, simulate more hooking attempts from previously learned attacks. This could also be an interesting use case for training AI/ML models, and provide additional simulated attacks to verify eBPF security tools and platforms.

How Chaos engineering can benefit from eBPF

While working on my QCon London talk, I thought of eBPF as a way to collect and inject data for chaos experiments. If eBPF allows us to access low-level in-kernel information, we can also change the data and simulate production incidents. There is a research paper about Maximizing error injection realism for chaos engineering with system calls, introducing the Phoebe project. It captures and overrides system calls with the help of eBPF.

Existing chaos experiments happen on the user level. DNS Chaos in Chaos Mesh for example is injected into CoreDNS which handles all DNS requests in a Kubernetes cluster. What if there was an eBPF program sitting on the Kernel level that hooks into DNS requests before they reach the user space? It can perform DNS request analysis and inject chaos with returning wrong responses for the resolver requests. Some work already has been done with the Xpress DNS project, which is an experimental DNS server written in BPF for high throughput and low latency DNS responses. The user space application can add/change DNS records to a BPF map which is read by the kernel eBPF program. This can be an entry point for a new chaos experiment with DNS and eBPF.

Summarizing all the ideas with eBPF chaos injection, new chaos experiments can be created to simulate rootkit behavior and call home to verify security observability and enforcement. Intercepting traffic to cause delays or wrong answers for TCP/UDP and DNS, and CPU stress ideas can help verify reliability and observability.

Chaos eBPF: we’ve got work to do

The advantages and benefits of eBPF sound clear, but what is missing, where do we need to invest in the future? Which risks should we be aware of?

One focus area is DevSecOps and SDLC with treating eBPF programs as any code that needs to be compiled, tested, validated against code quality checks and security scanning, and potential performance problems. We also need to avoid potential supply chain attacks. Given the complex nature of eBPF, users will follow installation guides and may apply curl | bash command patterns without verifying what will happen on a production system.

Testing eBPF programs automatically in CI/CD pipelines is tricky, because the kernel verifies the eBPF programs at load time and rejects potential unsafe programs. There are attempts to move the eBPF verifier outside of the kernel, and allow testing eBPF programs in CI/CD.

There are risks with eBPF, and one is clearly: root access to everything on the kernel level. You can hook into TLS encrypted traffic just after the TLS library function calls have the raw string available. There are also real world exploits, rootkits and vulnerabilities that are using eBPF to bypass eBPF. Some research has been conducted to use special programming techniques for exploits and data access, which go undetected by eBPF security enforcement tools. The cat and mice game will continue …

A wishlist for eBPF includes:

  • Sleepable eBPF programs, to pause the context, and continue at a later point (called “Fiber” in various programming languages).
  • eBPF programs that observe other eBPF programs for malicious behavior, similar to monitor-the-monitor problems in Ops life.
  • More getting started guides, learning resources, and also platform abstraction. This reduces the entry barrier into this new technology so that everyone can contribute.

Conclusion

eBPF is a new way to collect Observability data; it helps with network insights and security observability and enforcement. We can benefit from debugging in production incidents. Chaos engineering helps to verify observability and eBPF programs, and new ideas for eBPF probes in chaos experiments will allow it to iterate faster. Additionally, we are able to benefit from more data sources beyond traditional metrics monitoring – correlate, verify and observe production environments. This helps the path to DataOps, MLOps/AIOps – AllOps.

Developers can benefit from auto-instrumentation for Observability driven development; DevOps/SRE teams verify reliability with chaos engineering, and DevSecOps will see more cloud-native security defaults. eBPF program testing and verification in CI/CD is a big to-do, next to bringing all ideas upstream and lowering the entry barrier to using and contributing to eBPF open-source projects.

Thinking Deductively to Understand Complex Software Systems

Key Takeaways

  • We use more than one kind of logic in everyday life and when writing code.
  • Certain kinds of logic are unintuitive in abstract situations meaning we can miss some opportunities to reason logically about code.
  • Much of the power of tests is that they let us apply logical reasoning automatically, even in situations that are too abstract to think intuitively about.
  • We can use tests to analyse code dynamically (by running it), which can be quicker and more effective than a static code analysis.
  • These techniques are valuable for understanding software with a high degree of accumulated complexity.

This article discusses a number of debugging techniques as well as some of the theories involved in the logic of some common and less common software testing techniques. The main goal is to think through the role of tests in helping you understand complex code, especially in cases where you are starting from a position of unfamiliarity with the code base.

I think most of us would agree that tests allow us to automate the process of answering a question like “Is my software working right now?”. Since the need to answer this question comes up all the time, at least as frequently as you deploy, it makes sense to spend time automating the process of answering it. However, even a large test suite can be a poor proxy for this question since it can only ever really answer the question “Do all my tests pass?”. Fortunately, tests can be useful in helping us answer a larger range of questions. In some cases they allow us to dynamically analyse code, enabling us to glean a genuine understanding of how complex systems operate, that might otherwise be hard won.

We will look at some code examples shortly, but first, we need to think about a simple, but famous, puzzle.

Here we have some cards with numbers on one side but the backs are different colours for some reason; maybe I mixed up the decks. I tell you that all the even-numbered cards are red on the back. How can you check that? This puzzle was devised by Peter Wason in his paper, Reasoning about a Rule.

When presented with this puzzle pretty much everyone spots that it would be a good idea to turn over the even-numbered card to see if it’s red on the back. The trick that a lot of people miss, however, is that turning over the card with the brown side showing to check that it has an odd number on the other side is just as valid. In other words, you should check the negative case as well as the more intuitive positive case.

The interesting thing that the original study showed is that one reason people often miss the trick is that our reasoning relies a lot on real-world intuition. Few people miss the negative cases if the puzzle is posed using everyday items in a way that puts money or fairness at stake. That tells us something that I think should be relevant to all programmers: When thinking in the abstract, as is usual with software, it’s easy for the logic of the situation to get lost, especially when dealing with negative cases.

Another lesson from Wason’s selection puzzle is that there are two, quite different, ways in which people can think. One way, called inductive thinking, looks for positive cases with something in common, such as several even numbered cards with red backs, and attempts to draw general conclusions; often the conjecture that the pattern applies in all cases. The alternative is deductive thinking which in some cases can be used to argue by contradiction e.g. If there is a brown card with an even number on the other side then not all the even-numbered cards are red on the back, but this kind of logic can sometimes be unintuitive.

As programmers, I think we’ll take any kind of inference that works for us. We don’t need to be academic about it, but if we’re missing the cases where we could argue by contradiction we may be missing some opportunities to use deduction to better understand our code. How to apply deductive arguments by contradiction is the thing I want us to think through in this article. This takes us straight back to tests because the logic of tests often has the form of an argument by contradiction. Time for an example.

If you’ll excuse the pseudocode, I’m trying here to display the name of a user on a web page, let’s say it’s their profile page.

String formatUserName(User user) {
    return user.firstName.formatName() +
        user.lastName.formatName();
}

This code works fine, but now we’re asked to support users with middle names so I change the code to handle that case.

String formatUserName(User user) {
    return user.firstName.formatName() +
        user.middleName.formatName() +
        user.lastName.formatName();
}

I’m a diligent professional so before I ship my change I run up the app in a test environment and check that the middle name now appears on the user profile page. In this case, I see the name John Peter McIntyre on the profile page, so job done, ship it.

Unfortunately, while the page works for the user I tested, it now returns 500 for 90% of other users. These are the ones who don’t have a middle name, so for them user.middleName is null, and trying to format it throws an exception. I don’t mean to suggest that you are silly enough to make this mistake, though I think most of us would admit to having made similar ones in the past, but the thought process we follow to spot this mistake before we make it is an argument by contradiction. The case of someone without a middle name invalidates the general rule that the code asserts.

Fortunately, I wasn’t actually so cavalier with my client’s website. I have some tests from when I wrote the original code and when I ran them before deploying the code this one failed:

testFormattingNameOfUserWithFirstNameAndLastName() {
    User user = new User("John", null, "McIntyre");
    String name = user.formatUserName();

    assertEqual("John McIntyre", name);
}

We didn’t have to argue by contradiction, our test did it for us. The test suite as a whole argued by contradiction in hundreds of different cases, saving us a lot of effort and Wason’s selection puzzle tells us why this is so valuable; we’re not very good at thinking deductively in abstract cases, but our tests can do that for us.

We’ve just looked at a simple example, but we can apply similar ideas when trying to understand much more complex code. Let’s suppose that we get a bug report saying that on the profile page, the middle name and the last name are appearing in the wrong order, so we actually get John McIntyre Peter. We’ve just fixed the formatting code — and we know it’s tested — so the problem must be elsewhere. It might be time to start digging through code, potentially starting with the input data and following it all the way through to the UI. It’s looking like a long day of debugging ahead of us. However, there might be a simpler way; if we can just find an appropriate test that processes the user name data, it could help us understand the situation. Tests can be useful in this kind of situation, even if they’re currently passing, because you can see the inputs and outputs of the relevant part of the system and they execute a subset of the code so the problem is often a narrower one.

So it seems we just need to look through the tests which might not take too long. Of course, we’re assuming that the tests for data input are with the data input code, the tests for persistence are with the persistence code and the tests for business logic are with the business-level code, but before you get too confident I’m going to be mean and introduce a bit more of reality into the scenario.

How do you organise your tests? Perhaps you read a book about software development and it told you to organise your tests like this:

This implies that while most of the tests you write are unit tests, you also have a fair number of integration tests that test more than one part of the system together, and your high-level tests that run the whole system are restricted to a few that focus on particular areas of value. If you follow this pattern then it’s probably not too hard for you to look through your tests to find the one we need to help us fix our bug. Most of your tests will probably be in predictable places and most of them will be simple enough to quickly decide if they’re relevant.

However in reality this pattern is often not followed. I don’t want to be too normative about it since there are reasonable people on the internet who think the right-side-up pyramid is not optimal and some even advocate an upside-down pyramid. Let’s just acknowledge that you may well be dealing with a situation like this:

In this case, the test you’re looking for might not be close to the code it’s testing and a quick inspection might not be enough to understand each test. That’s the problem in front of us. We’d like to find a test that will help us track down our bug, but we’re dealing with a complex system that no single person understands completely. Or, we might be relatively new to the project so we don’t have much understanding of the system yet, and trying to understand it is exactly the problem we’re trying to solve.

Here’s the trick: we’re going to break the code, just slightly. Let’s rewrite our formatting function:

String formatUserName(User user) {
    return user.firstName.formatName();
}

We know our unit test won’t pass now, but that’s ok. What’s really going to give us insight into the system is running the entire suite of tests. If those tests are an upside-down pyramid that could take a few minutes, but it’s still much faster than reading all the tests. When the tests have finished running some of them will have failed because we broke the code. Those tests will be ones that process user name data which is great because we know there’s a bug in the user name code. It turns out this integration test failed:

testSavingAndRetrievingUser() {

    SaveUserRequest request = new SaveUserRequest("{
        "id": 1234,
        "user": {
            "firstName": "John",
            "lastName": ”McIntyre”
        }
    }");

    UserSystem system = new UserSystem(
        new Logging(),
        new UserPersistence(),
        new BusinessLogic(),
        new FakeExternalDependencyClient());

    system.handleSaveUserRequest(request);

    ReadUserResponse response = system.handleReadUserRequest(
        new ReadUserRequest("{ "id": 1234 }")
    );

    User user = User.parseUser(response.body);

    assertEqual(“John McIntyre”, user.formatUserName());
}

It’s not the most readable test — I’ve seen much worse in practice — but it shows us all the components that go into the process. Let’s look at UserPersistence.

class UserPersistence {

    DataSource dataSource = new DataSource();

    saveUser(User user) {
        dataSource.execute("
            INSERT INTO user SET
                user.firstName,
                user.lastName,
                user.middleName;
        ")
    }
}

And we found the bug! The middle name and the last name are persisted in the wrong order. We can see now that we don’t have a test that tests the persistence of the middle name so we can make sure that’s tested now. Also don’t forget to put back the formatting code we broke!

So breaking the code slightly brought us straight to a test that helped us understand the buggy part of the system and fix the problem. The tests helped us do more than inspect the code statically; we performed a dynamic analysis that identified relevant parts of the code. I’m not sure how well-known this technique is, but I’ve explained it to enough experienced developers to believe it’s underused. It’s also a great example of using negative cases to understand complex software systems.

I’m not suggesting that we all abandon inductive thinking and rely solely on deduction. We’re not machines and inductive thinking comes naturally to us even if, as we saw in the first example, it can sometimes lead to disaster. But understanding where we can take advantage of the logic of arguing by contradiction can really help when developing software.

As we saw in the first example, a test that was originally written to confirm the correctness of a particular feature becomes a negative test for subsequent features that relate to the same code, potentially telling us if our new code doesn’t satisfy the original requirement. Repeatedly subjecting our codebase to attempts to falsify our implementation, by running the suite of all previously written tests, gives us more confidence in our implementation than we would get if each test only applied to the case that it was written for.

Software is complex, often to the extent that it’s impossible to understand in totality, so having the ability to dynamically analyse code using tests can lead us to answers much more quickly than a static inspection of the code itself. Few of us study logic on the way to becoming great programmers, but a little understanding can help elucidate some common and less common programming techniques.

Debugging Outside Your Comfort Zone: Diving Beneath a Trusted Abstraction

Key Takeaways

  • Even unfamiliar parts of our software stacks are ultimately just software that we can turn our debugging skills to – there’s no magic.
  • Simple config which works fine in normal situations can lead to disaster in the presence of coordinated failures.
  • Automated orchestration is a useful and necessary part of running production systems, but does reduce our understanding of how those systems work.
  • When you’re trying to reproduce an issue in any kind of test environment, always remember to question how well that environment matches production.
  • There are some lessons better learned from field reports than textbooks, so please share your debugging stories!

Today we’re going to take a deep dive through a complex outage in the main database cluster of a payments company. We’ll focus on the aftermath of the incident – the process of understanding what went wrong, recreating the outage in a test cluster, and coming up with a way to stop it from happening again. Along the way, we’ll dive deep into the internals of Postgres, and learn a little about how it stores data on disk.

What happened

This outage wasn’t subtle. The database that powered all the core parts of our product – a payments platform – was down. We very quickly went from everything working to nothing working, and the pager was about as loud as you’d expect.

Knowing what was broken was easy. An incident like this produces plenty of alerts that point you in the right direction. When we logged onto the primary node in our Postgres cluster, we saw that the RAID array which held the data directory had failed. Not great, but it shouldn’t have caused an outage like this, as we were running a three-node cluster of Postgres servers managed by Pacemaker.

What was more difficult to track down was why Pacemaker hadn’t failed over to one of the replicas like we expected. We spent some time looking through its logs and using its CLI to try and get it to retry the failover, but we couldn’t manage to recover it.

Restoring the system to a working state

We spent about an hour trying to get the cluster automation working again, which is already a long outage for a payments platform. At that point, we had to change tack, and decided to ditch the Pacemaker cluster automation and promote one of the replicas to primary by hand.

When I say “by hand”, I mean it. We SSH’ed into the two remaining replicas, switched off the Pacemaker cluster, and promoted one of them to be a new primary. We reconfigured our application servers to connect to that server, which brought the site back up.

With the product working again, we could at least briefly breathe a sigh of relief, but we knew that we were going to have to figure out what happened with the cluster automation and how to fix it.

Investigating and recreating the outage

This ended up being the most involved outage of my career so far – hopefully it stays that way.

All in all, we spent around two weeks figuring out what went wrong and how to fix it. Since this was such a high impact issue, we were really keen to understand exactly what happened and fix it.

We split our Core Infrastructure team of five into two sub-teams: one focused on recreating the incident in an isolated test setup, so that we could make changes to prevent it from happening in future, and the other focused on safely restoring us from our now hand-managed cluster back to a setup managed by the clustering software.

One thing that made it tricky was that we needed a test environment which would let us rapidly perform experiments to test different hypotheses. We did get slightly lucky there: only a few months before, we’d been working on improvements to our cluster automation setup, and as part of that we’d put together a Docker Compose version of the cluster setup.

We settled on using that containerised setup to run our experiments. It made it easy to spin up a fresh environment, run and then inject different combinations of faults into it to try and reproduce the outage, then tear it down in a matter of a minute or so.

From there, we gathered as much information as we could from the production logs we captured on the day of the incident. Within them, we found five potential avenues of investigation:

  1. The RAID 10 array on the primary node simultaneously losing 3 of its 4 disks, and becoming unusable
  2. The kernel setting the database’s filesystem read-only on that node
  3. The clustering software (Pacemaker) detecting the failure of the primary, but not failing over to a replica
  4. A crash in one of the subprocesses of the synchronous replica, which caused it to restart
  5. A suspicious log line about replication data received by the synchronous replica

From there it was a case of writing scripts to inject the same faults into our test cluster. While those scripts weren’t written with the intention of being easy to run by people outside of the team, you can see a slightly tidied up version of that script here if you’re curious exactly what we did.

It took us a lot of trial-and-error, but by making small adjustments to the faults we were injecting, we eventually arrived at that script and were able to reproduce the exact same failure we’d seen in production.

The starting point: low-hanging fruit

When you’re presented with several different leads in an issue you’re debugging, you need some way of narrowing down and focusing on the ones that are most likely to get you to an answer in a reasonable amount of time.

There are plenty of more subtle factors involved, but I find that a good rule of thumb for evaluating an avenue of debugging is to consider how likely it is for that avenue to be a contributing factor in an issue, and how much effort it will take to investigate.

The best leads to pursue are high likelihood and low effort. Even if they end up being wrong, or just insufficient in isolation, you don’t waste much time looking into them. Conversely, you’ll want to move anything with a low likelihood of contributing to the incident and high effort to recreate lower down the list. Everything else lives somewhere in the middle, and prioritising those is slightly more down to gut feeling.

In our case, we believed it was highly unlikely that the precise way in which the primary failed (the RAID array breaking) was a necessary condition for automated failover not to work in our Pacemaker setup. It was also high effort to recreate – potentially well beyond any reasonable effort (it could have been down to a firmware bug in a RAID controller, or a brief dip in power in the machine). That left us with a list that looked like this:

  1. The RAID 10 array on the primary node simultaneously losing 3 of its 4 disks, and becoming unusable
  2. The kernel setting the database’s filesystem read-only on that node
  3. The clustering software (Pacemaker) detecting the failure of the primary, but not failing over to a replica
  4. A crash in one of the subprocesses of the synchronous replica, which caused it to restart
  5. A suspicious log line about replication data received by the synchronous replica

Avenues 3-5 all seemed relatively likely contributors, but 5 was going to be high effort, so we started with 3 and 4.

The first version of the script we wrote to try and reproduce the outage did just two things after setting up the database cluster:

# on primary - forceful kill of main process
# (very rough approximation of hard failure)
kill -SIGKILL 

# on synchronous replica - subprocess crash
# (same signal as seen in production logs)
kill -SIGABRT 

Sadly, it wasn’t going to be that easy – those two actions weren’t enough to reproduce the outage. We’d need to turn our attention to lead number 5 – the suspicious message we’d seen about a replication issue.

The logs that had us scratching our heads

There was a pair of log lines that really jumped out at us when we were reviewing all the logs we’d pulled from production on the day of the incident, and had us stumped for a while:

2023-02-24 17:23:01 GMT LOG: restored log file "000000020000000000000003" from archive

2023-02-24 17:23:02 GMT LOG: invalid record length at 0/3000180


One of the Postgres replicas was complaining about some data it had received via replication (specifically through the WAL archive/restore mechanism, which we were using alongside streaming replication). We thought that might be part of what caused the clustering software not to fail over to a replica when the primary failed, and we wanted to understand exactly what was causing this error to be logged on the replica.

Figuring out exactly what caused it meant diving deep into the Postgres internals, way below the abstractions we’d taken for granted in many years of running it in production. Specifically, we needed to figure out how to take a chunk of the write-ahead logs (WALs), break it in a very specific way, and inject it into the synchronous replica. If we saw the same log line, we’d know we’d done it right.

After a mixture of looking through the Postgres source code, and staring at WAL files from our test cluster in a hex editor, we were able to piece together enough information to know what we needed to add to our reproduction script. Before we look at the script itself, let’s take a quick run through what we found.

First off, we found the part of the Postgres source code where that log line was being emitted:

{
    /* XXX: more validation should be done here */
    if (total_len < SizeOfXLogRecord)
    {
        report_invalid_record(state, "invalid record length at %X/%X",
                              (uint32) (RecPtr >> 32), (uint32) RecPtr);
        goto err;
    }
    gotheader = false;
}

Postgres source – tag REL9_4_26 – src/backend/access/transam/xlogreader.c:291-300

It’s a reasonably straightforward check for the validity of the XLog – another name for the transaction log or write-ahead log.

To reach the error handling code which produced the log line we were interested in, we’d need to make the if (total_len < SizeOfXLogRecord) conditional evaluate to true. That total_len field is initialised here by retrieving the first field of the XLogRecord struct.

record = (XLogRecord *) (state->readBuf + RecPtr % XLOG_BLCKSZ);
total_len = record->xl_tot_len;

Postgres source - tag REL9_4_26 - src/backend/access/transam/xlogreader.c:272-273

typedef struct XLogRecord
{
    uint32        xl_tot_len;        /* total len of entire record */
    TransactionId xl_xid;        /* xact id */
    …

Postgres source - tag REL9_4_26 - src/include/access/xlog.h:41-44

We knew that this struct was being initialised with data from the WAL files on disk, so we turned there to try and figure out how we could deliberately inject a fault that reached this error handler.

We started out by looking at a WAL file generated by inserting some example data to a database.

[Click on the image to view full-size]

We could see that our data was there in plain-text, surrounded by the various other fields from the XLogRecord struct, encoded in a binary format. We knew the length field would be in there somewhere, but we weren’t sure where to find it.

That was when we had our "aha!" moment. If we inserted data which increased in length by one character at a time, we could look for part of the binary stream that was incrementing at the same time. Once we did that, the length field was staring us right in the face:

[Click on the image to view full-size]

All we needed to do was change that part of the replication data on-the-fly in our test cluster, and we’d be able to recreate point 5 from our list above.

With our new understanding of the on-disk data structure, we were able to write a small script to inject the fault we needed. The code was pretty janky, and used a regular expression to replace the correct bytes, but it did the job. This wasn’t code that would be needed for long, and it was never going anywhere near a production system.

Now we had everything we needed to test another one of our debugging leads.

Into the depths: reflections on diving through abstraction layers

I know that right now you’re desperate to hear how it went, but I want to quickly talk about something that’s easy to miss in all the technical detail: how it feels to work on an incident like this.

I didn’t think about it that way at the time, but in hindsight it struck me just how unusual an experience it was. It’s weird looking deep into the internals of a system you’ve used for years - in this case, seeing how exactly your SQL INSERT statement gets transformed into data written to disk.

Up until this incident, I’d been able to treat the WALs as an abstraction I never needed to look beneath. Sure, I’d configured Postgres to use them in various ways, but I’d never needed to know the exact format of the data stored inside them.

It’s also a strange feeling to work on something that’s really technically interesting, but with a strong background sense of urgency. Ultimately, we were trying to understand an incident that had a really serious business impact, and we needed to stop it from happening again.

You want to go off exploring every new thing you read about - especially when I found a Postgres internals ebook with a detailed chapter on write-ahead logging - but you have to rein that curiosity in and focus on making sure that outage doesn’t happen again.

The missing 6th lead: Pacemaker config

Right, back over to our work to recreate the outage.

With the extra break-wal section added to our scripts, we ran the whole thing again, and to our dismay, the Pacemaker cluster in the test setup successfully failed over to the synchronous replica. Clearly, we were still missing something.

After a lot more head scratching, one of our teammates noticed a difference between our local test cluster (running in Docker Compose) and the clusters we ran in staging and production (provisioned by Chef). We’d recently added a virtual IP address, managed by Pacemaker, which was set to run on either of the two Postgres replicas (i.e. not on the Primary). The reason we’d introduced it was so that we could take backups without adding extra load on the primary. The way you express that in Pacemaker config is through a colocation constraint:

colocation -inf: BackupVIP Primary

That constraint can roughly be translated into English as "never run the BackupVIP on the same node as the Primary Postgres instance". It turned out that adding that VIP with its colocation constraint to the test cluster was enough to reproduce the outage.

Once we were able to reliably reproduce the outage we’d seen in production on our containerised test setup, we were able to use that to test different hypotheses about how we’d configured Pacemaker.

Eventually, we found that it was the combination of the way we’d written that colocation constraint with another setting which we’d had in the cluster for much longer:

default-resource-stickiness = 100

The combination of those two pieces of config, plus a co-ordinated failure of the primary Postgres instance and crash on the synchronous replica was enough to put the cluster into a state where it would never successfully promote the synchronous replica, because Pacemaker’s constraint resolution couldn’t figure out where everything should run.

Through more experimentation, we were able to find a looser colocation constraint which didn’t confound Pacemaker’s constraint resolution code in the face of the same process crashes:

colocation -1000: BackupVIP Primary

Loosely translating this into English, it says "avoid running the BackupVIP on the Primary Postgres instance, but don’t worry if it’s there briefly".

We spent a bit more time validating that new configuration, and then copied it over to our config management system and rolled it out to production.

Narrowing down: finding the minimal set of faults

At this point, the reproduction script that had got us to our answer was simulating four conditions from the outage:

  1. The RAID 10 array on the primary node simultaneously losing 3 of its 4 disks, and becoming unusable
  2. The kernel setting the database’s filesystem read-only on that node
  3. The clustering software (Pacemaker) detecting the failure of the primary, but not failing over to a replica
  4. A crash in one of the subprocesses of the synchronous replica, which caused it to restart
  5. A suspicious log line about replication data received by the synchronous replica
  6. The BackupVIP running on the synchronous replica

We were curious which of those conditions were strictly necessary to recreate the outage, and if any of them were incidental. After selectively removing them one at a time, it turned out that the error in the replication data was a complete red herring. A minimal recreation of the incident needed only three conditions:

  • The clustering software (Pacemaker) detecting the failure of the primary, but not failing over to a replica
  • A crash in one of the subprocesses of the synchronous replica, which caused it to restart
  • The BackupVIP running on the synchronous replica

We could only laugh about how much time we’d spent in the weeds of the Postgres source code and staring at a hex editor to learn how to inject a fault that wasn’t even necessary to cause the outage. The consolation prize was that we’d learned a lot about how the database at the core of our most important services worked, which was valuable in itself.

What we learned from the outage

Naturally, the team learned a whole bunch about Postgres internals. That was cool, and very interesting to me personally as databases and distributed systems are my favourite topics in computing, but there were also some broader lessons. We could all be slightly more confident in our ability to run it well, and debug it more effectively when things went wrong.

Something I found really striking was how seemingly simple configuration can lead to complex behaviour when it interacts with other configuration. Those two pieces of Pacemaker config which worked perfectly fine by themselves - and even worked perfectly fine under simple failure conditions - seemed innocuous until we were more than a week into the investigation. Yet with just the right combination of simultaneous failures and the BackupVIP being on the synchronous replica (rather than the asynchronous one, which the constraint would also allow), they led to complete disaster.

The other one, which scares me to this day, is just how much automation can erode your knowledge. We spent the first hour of the outage trying to get the cluster automation working again, and I think a big part of that was that we’d relied on that automation for years - we hadn’t stood Postgres up by hand for a long time. If we’d been more comfortable doing that, perhaps we would have switched up our approach earlier in the incident.

It’s tricky though, because it’s somewhat inherent to building self-healing systems. The best suggestion I’ve got is to occasionally run game days to improve your team’s ability to respond to issues in production. They’re something we’d started doing about a year before this outage, but there’s only so much ground you can cover, and production systems will always find new ways to surprise you.

With the benefit of hindsight, a game day that involved standing Postgres back up without the cluster automation would have been great preparation for an outage like this, but it’s hard to predict the future.

Beyond databases: how to debug complex issues

The big thing you need is persistence. You’re going to go down many dead ends before you get to an answer, and getting there is going to mean working outside of your comfort zone. Debugging requires a different set of muscles to writing software, and it can be hard to stay motivated for long periods of time when a system is doing something that doesn’t make sense. It’s definitely something that gets better with practice.

It’s so easy to see a write-up of an issue and think the team working on it solved it in a few hours. The hardest issues are far more complex, and can take days or even weeks to understand.

I’d also like to call out something really important when working in a test environment: keep questioning whether it actually matches the production environment where the failure happened. It’s so easy to spend a bunch of time going in the wrong direction because of a difference like that.

If the issue you’re trying to debug is amenable to it, you might be better off debugging in production. You can go a long way by adding instrumentation into your running application, and can minimise risk to users with techniques like traffic shadowing and feature flags (e.g. by only enabling a buggy feature for a small handful of users). Sadly, that wasn’t an option in this case as we were trying to diagnose an issue that caused a full database outage.

Tell me more

If you’ve enjoyed reading about this outage investigation, and you want even more details, there are two places you can go:

The talk in particular covers some extra background on Postgres’ replication mechanisms, and spends longer stepping through the trial-and-error process of figuring out how to inject a fault into the replication data.

How we can get better together

Something I’m really passionate about is learning in the open. The kind of things you learn when someone talks about how they debugged a weird issue are different from what you tend to learn from more formal materials (books, courses, etc) - not to say the latter aren’t valuable too! People directly sharing their experience tend to focus on very concrete actions they took, and surround them in the context that they were working in.

I get it. Sometimes there are legal considerations when sharing work, or even just practical constraints on how much time we have to do so. There’s probably some way you can talk about a problem you solved that tonnes of people would find interesting.

It doesn’t have to be as big a commitment as writing a presentation, or even a short blog post. It can be as small as a social media post about a bug you investigated which adds some commentary.

So if you’ve made it this far, that’s what I’ll ask of you. Next time you run into a weird problem that was interesting to debug, share it somewhere. You may be surprised how many people learn something from it.

I’m going to shamelessly take an idea from the Elixir community, where people regularly use the #MyElixirStatus hashtag to talk about what they’re doing with the language. If you do post about something interesting you debugged on social media because of this article, tag it with #MyDebuggingStatus. I’ll try to keep an eye out.

Data-Driven Decision Making – Software Delivery Performance Indicators at Different Granularities

Key Takeaways

  • Engaging teams, middle management (product managers, development managers, operations managers) and upper management (“head of” roles) in data-driven decision-making enables holistic continuous improvement of a software delivery organization
  • Optimizing a software delivery organization holistically can be supported by establishing software delivery performance indicators at different granularities
  • For each measurement dimension selected by the organization for optimization, indicators at three granularities can be created: team-level indicators, product-level indicators, and product line-level indicators
  • The three indicator granularities have been used to facilitate data-driven decision-making at different levels of organization: teams, middle management and upper management
  • A dedicated process is required at each organizational level to regularly look at the indicators, analyze the data and take action

The Data-Driven Decision Making Series provides an overview of how the three main activities in the software delivery – Product Management, Development and Operations – can be supported by data-driven decision making.

Introduction

Optimizing a software delivery organization is not a straightforward process standardized in the software industry. A myriad of parameters can be measured, which would produce lots of measurement data. However, getting the organization to analyze the data and act on it is a difficult undertaking. This, however, is the most impactful step in the whole optimization process. If people in the organization do not act on the measurement data, there is no data-driven optimization of the organization taking place.

To advance the state of the art in this area, the “Data-Driven Decision Making” article series from 2020 provided a framework of how the three main activities in the software delivery – Product Management, Development and Operations – can be supported by data-driven decision making. Since then, the 25 teams-strong organization running the Siemens Healthineers teamplay digital health platform and applications gained several years of experience in optimizing itself using the framework. In this article, we present our new and deeper insights of how a socio-technical framework for optimizing a software delivery organization has been set up and brought to the point of regular use.

Selecting measurement dimensions

In a software delivery organization, a myriad of things can be measured. These range from pure mathematical measurements easy to count, such as e.g. bug numbers in the bug tracking system for a time period, to socio-technical process measurements difficult to quantify, such as e.g. outage duration (when does the outage really start and stop?).

Whatever the measurement, it is only effective if it leads to people taking regular action on it. Without taking action to periodically analyze the measurement data and, based on the analysis results, making a change in the organization, the measurement process can be attributed to avoidable waste.

In a given organization, it might be difficult to agree on the measurement dimensions to optimize the software delivery. A guiding principle should be to only measure what the organization is potentially willing to act upon. Acting includes topic prioritization, process changes, organizational changes, tool changes and capital investments.

In recent years, some measurements have become popular. For example, measuring stability and speed of a software delivery process using the DORA metrics enjoys popularity as it is rooted in a multi-year scientific study of what drives the software delivery performance.

Another example is measuring reliability. Here, SRE is getting popular as a methodology employed by a growing number of organizations to run services at scale. It enables reliability measurements based on the defined SLIs, SLOs and corresponding error budget consumption tracking.

In terms of measuring value, there are movements in the software industry to establish leading indicators of value that the teams can act upon. This is as opposed to the lagging indicators of value represented by the revenue streams. Some organizations use hypothesis-driven development or the North Star framework for this. So far, no industry-leading framework has dominantly emerged in this area yet.

As described in the article “Data-Driven Decision Making – Optimizing the Product Delivery Organization” from 2020, our organization running the Siemens Healthineers digital health platform and applications decided to measure the value, speed, stability and reliability. In 2022, we added another measurement dimension: cloud cost. Why did we decide to measure these, and not other, dimensions? The answers are in the table below.

 

Measurement dimension Reasoning for selecting the measurement dimension Willingness to act on the measurement data Measurement framework
Value      We sell subscriptions to digital services in the healthcare domain. This is a new market. Being able to measure the value of subscriptions to customers (not just the cost) enables us to steer product development using the new value trend feedback loop. High willingness of the product owners to increase the value of subscriptions to customers in order to decrease subscription churn and encourage new subscriptions. North Star framework
Cloud cost  The cost side of our business case depends on the recurring cost generated by using the Microsoft Azure cloud. Knowing the cloud cost by team, by product, by product line, by business unit and by cost center provides cloud cost transparency to various stakeholders on a daily basis enabling well-informed decision-making. High willingness of the budget owners and finance department to allocate the cloud cost properly to the involved product lines and cost centers. Further, high willingness to stay within the allocated budgets and not let the cloud cost inflate without prior agreement. Homegrown FinOps
Speed Speeding up feature delivery reduces the inventory cost, feature time to market, test automation debt and deployment automation gaps. High willingness of the product owners and development teams to be able to release features every 2-4 weeks. DORA framework
Stability Deployment stability is important when increasing the speed of feature delivery to production. Higher frequency of feature delivery should lead to higher deployment stability (and not the other way around) because the size of deployed features is reduced. High willingness of the development teams to deploy features in a way unnoticeable by customers using zero downtime deployments and other techniques. DORA framework
Reliability Providing a digital service reliable from the user point of view is fundamental to fostering long-term subscribers to digital health services. Quantifying important reliability aspects provides transparency into the overall service reliability in all production environments. High willingness of the development and operations teams to improve reliability. SRE framework

 

So far, we decided not to measure any other dimensions. There are several reasons for this:

  1. We can optimize the organization in a way that is impactful to customers and the business using the current set of five measurement dimensions:
    • a. Optimizing the subscription value makes our subscriptions more valuable to customers. The feature prioritization process becomes more evidently value-driven.
    • b. Optimizing the cloud cost makes our business case stronger. The teams produce software architectures and designs that take the cloud cost in consideration from the beginning.
    • c. Optimizing deployment speed contributes to the ability to find a feature-market fit fast. The teams invest in proper feature breakdown, architecture with loose coupling, test automation, deployment automation etc.
    • d. Optimizing deployment stability contributes to a good user experience on frequent updates. The teams invest in zero downtime deployments, API backward compatibility and other contributing technical capabilities.
    • e. Optimizing reliability contributes to a consistently good user experience throughout the subscription duration. The teams invest in effective monitoring and incident response.
  2. We have seen that optimizing the current set of measurement dimensions implicitly drives other organizational improvements such as:
    • a. Teams invest heavily in loose coupling, test automation and deployment automation by default. What is more, they constantly look for ways to optimize their deployment pipelines in these areas. This drives a very healthy engineering culture in the software delivery organization.
    • b. The organization invests in continuous compliance capability by implementing a set of tools automating regulatory document creation. The tools run automatically on each deployment pipeline run.
    • c. Teams very regularly update their 3rd party dependencies to frameworks and libraries.
    • d. We have not seen the impact of the measurements on security and data privacy practices yet except for the teams having security penetration tests run more frequently.

All in all, while we could introduce other measurement dimensions, the current ones seem to be sufficient to drive a reasonably holistic continuous improvement using a rather small optimization framework.

Setting up a measurement system

The measurement system we set up engages the three major organizational levels in data-driven decision-making: teams, middle management and upper management.

For each measurement dimension chosen above, we set up three indicator granularities: team-level indicators, product-level indicators and product line-level indicators. This is illustrated in the figure below.

[Click on the image to view full-size]

Following this, the data generated for each measurement dimension can be thought of as a cube spawned across three axes:

  • X axis: indicator granularity
    • Team-level indicator
    • Product-level indicator
    • Product line-level indicator
  • Y axis: the measurement dimension itself
    • Value, or
    • Cloud cost, or
    • Stability, or
    • Speed, or
    • Reliability
  • Z axis: organizational level
    • Teams
    • Middle management (product managers, development managers, ops managers)
    • Upper management (‘head of’ roles)

This enables the entire product delivery organization to work with the data in the granularity that is right for the question at hand. Schematically, the data cubes are illustrated in the figure below.

[Click on the image to view full-size]

For example, a product manager analyzing speed (the blue cube above) can look at the delivery speed data across all relevant products using product-level indicators. They can go further and look at the speed data in aggregate using product-line level indicators. If the product manager has a technical background, they can look at the speed data in greater detail using the team-level indicators. The result of the analysis might lead to prioritization of technical features speeding up the delivery for products where the increased speed to market holds the promise of accelerating the search for a product-market fit.

Likewise, a team architect analyzing reliability (the red cube above) can look at the service reliability for all services owned by their team (team-level reliability indicators). They can then proceed by looking at the service reliability for services they depend upon that are owned by other teams (also team-level reliability indicators). When considering to use a new product, they can initially look at the aggregated product-level reliability data from the past (product-level reliability indicators). If the reliability at the product level seems reasonable, they can drill down to the reliability of the individual services making up the product (team-level reliability indicators). The result of the analysis might lead to data-driven conversations with the product owners about the prioritization of reliability features over customer-facing features.

Similarly, the leadership team consisting of a head of product, head of development and head of operations may analyze the cloud cost data. They can start by looking at the cost data at the product-line level analyzing the cost trends and correlating them with the corresponding value trends. For product lines where the cost trends correlate inversely with the value trends, the leadership team can drill down into the cost and value data at the product level. The result of the analysis might be conversations with the respective product managers about the projected breakeven timepoints of newer products and revenue trends of mature products.

Setting up processes to act on data

Acting on the measurement data above across the organizational silos and levels requires dedicated processes to be set up. The processes are different for teams, middle management (product managers, development managers, operations managers) and upper management (“head of” roles). We found the following cadences and approaches useful in our organization.

 

Organizational level Data analysis cadence Process description  Indicator granularities used
Teams  Generally every three months, sometimes more frequently A dedicated one-hour meeting where the team looks at all available measurement dimensions (e.g. reliability, speed, stability, cloud cost and value), performs the data analysis together and derives action items Team-level and product-level indicators
Product management Roughly every three months Together with the teams, see the cell above Product-level indicators
Development management  Roughly every two months In scrum of scrums or a similar forum the speed and stability data is analyzed, associated process changes are discussed and improvement conversations with other roles are prepared Product-level indicators
Operations management Roughly every month, aligned with SRE error budget periods In operations review or a similar forum, aggregated reliability data is analyzed, services with the consistently low reliability in relevant SLIs (availability, latency, freshness etc.) are identified and improvement conversations with the service owner teams are prepared Product-level indicators and team-level indicators
Upper management Every 4-6 months in a formal setting; often in ongoing conversations and management standups Relevant data points are used in ongoing conversations and management standups. Portfolio discussions happen by using the relevant data. Management offsites contain presentations by using the relevant data. Product line-level and product-level indicators

A note on the data analysis cadence at the team level: although many teams chose to look at the indicators data every 3 months, some teams look at some of the indicators more frequently. Specifically, the cloud cost data is sometimes watched on a daily basis, especially at times when a team is optimizing the architecture or design to reduce cloud cost. Additionally, build and deployment stability data is sometimes watched on a weekly basis when the team is working on stabilization after a bigger redesign.

Optimizing organization

In this section, we present an example of how we managed to optimize the organization in terms of speed using the data-driven decision-making occurring simultaneously at different organizational levels.

Speed is a very relevant measurement dimension as everybody wants to deliver features faster. When we started, the appetite to speed up was highly present in all our teams. At some point, the speed indicators were introduced and brought to maturity. This enabled the following data-driven workflows throughout the organization:

 

At the team level

The lead times between the pipeline environments became apparent. For example, in a pipeline made up of the following environments

Build agent → Accept1 → Accept2 → Sandbox → Production

the four mean lead times between the five environments could be seen at a glance:

Build agent →time→ Accept1 →time→ Accept2 →time→ Sandbox →time→ Production

This gave the team owning the pipeline a new visual feedback loop into the speed of code movements between the environments. The team could optimize the speed by reducing the test suite runtimes.

At the middle management level

The product release lead times became apparent. For example, for a product X, the following graphs were drawn:

2021: Release1 →time→ Release2 →time→ Release3
2021: Release1 →time→ Release2 →time→ Release3
2022: Release1 →time→ Release2 →time→ Release3 →time→ Release4

The product and development managers could draw the following conclusions from the graphs’ analysis:

  • What are the product management reasons for product X not being released more frequently? E.g. the feature breakdown undertaken so far had not been done to the granularity necessary for a team to be able to implement and release small parts of the features incrementally. The features were too big for this, requiring larger and less frequent releases.
  • What are the technical reasons for product X not being released more frequently? E.g. tightly coupled architecture and manual test suites.
  • What are the project reasons for product X not being released more frequently? E.g. the release cadence set by project management in agreement with the wider organization simply did not foresee more releases in a year.
At the upper management level

The product line and product release lead times became apparent. The head of product and head of development saw the connection between the release cadences and the regulatory burden entailed with each release. They initiated a project between the product lines and the regulatory department to re-assess the regulatory framework in the organization.

It turned out that the production of documentation required by regulations was largely manual. Semi-automation was possible. This insight led to an organization-wide decision to invest in the area.

A couple of years in, the manual burden of producing release documentation required by the regulatory framework was greatly reduced. This paved the way to accelerating releases throughout the organization.

Summary

Optimizing a software delivery organization holistically is a complex endeavor. It can be supported well by providing measurement indicators at three granularities: team-level, product-level and product line-level indicators. This enables data-driven decision-making for the teams, middle management and upper management of the organization. Creating dedicated processes for analyzing the data and taking action at these three organizational levels facilitates holistic continuous improvement in the organization.

The Data-Driven Decision Making Series provides an overview of how the three main activities in the software delivery – Product Management, Development and Operations – can be supported by data-driven decision making.

DevEx, a New Metrics Framework From the Authors of SPACE

Key Takeaways

  • Researchers Abi Noda, Dr. Nicole Forsgren, Dr. Margaret-Anne Storey, and Dr. Michaela Greiler have published a paper that provides a practical path for improving productivity, which is focused on Developer Experience (DevEx).
  • Developer experience focuses on the lived experience of developers and the points of friction they encounter in their everyday work. The authors assert that focusing on developer experience is the key to maximizing engineering effectiveness, and introduce a framework for measuring and improving DevEx.
  • Organizations can improve the developer experience by identifying the top points of friction that developers encounter, and then investing in improving areas that will increase the capacity or satisfaction of developers.
  • The DevEx framework distills the factors affecting developer experience into three dimensions: feedback loops, cognitive load, and flow state. Leaders can select metrics within these three dimensions in order to measure and identify areas to focus on that would ultimately drive productivity improvements.
  • Surveys provide a practical starting point for capturing a holistic set of measures to fully understand the developer experience. Effective survey programs require attention to survey design, as well as the ability to break down results by persona and team and compare results against internal and external benchmarks.

A recently published research paper outlines a new framework for measuring and improving developer productivity.

This framework, referred to as the DevEx framework, was authored by Abi Noda, Dr. Margaret-Anne Storey, Dr. Nicole Forsgren, and Dr. Michaela Greiler.

Leaders have long sought to improve the productivity of their engineering organizations in order to help their businesses move faster, build new products, and tap into new and emerging trends.

However, knowing what to focus on in order to achieve this goal has remained elusive, despite recent methods such as DORA and SPACE. This new framework aims to address this gap.

Drawing on their extensive research and experience, the authors assert that focusing on developer experience is the key to maximizing engineering effectiveness.

Their paper presents frameworks which distill developer experience into its three core dimensions and provide a methodology for measuring it.

This article includes a summary of the paper’s key points along with commentary from the lead author, Abi Noda. Here’s also a link to the full paper.

Measuring productivity is hard

In a recent article, Google researchers Ciera Jaspan and Collin Green suggest two reasons why measuring developer productivity is so challenging: software engineering is not repeatable, and a developer’s productivity is highly affected by outside forces.

As for the latter, an outside force could be the complexity of the work (and whether it is necessarily that complex), the interactions with others to get the job done, or the organizational design. There are also factors that specifically affect developers, including flaky tests, build speeds, and technical debt.

The other reason why measuring productivity is difficult is that software development is a creative endeavor: it is not about the production of uniform, interchangeable outputs. “Attempts to quantify work productivity by borrowing methods from operating machinery are not suited to software engineering.”

“We have to remember that we’re working with a fundamentally human system. To understand how the system can be improved, we need to find out from human beings what they are experiencing.”

Developer experience offers a new lens

Developer experience provides a new way of understanding developer productivity: from the perspective of developers themselves. Developer experience encompasses how developers “feel about, think about, and value their work,” and focuses on the everyday realities and friction that developers face while performing their work.

Prior research has identified numerous factors that affect developer experience: for example, interruptions, unrealistic deadlines, and friction in development tools negatively impact how developers feel about their work. Having clear tasks, well-organized code, and pain-free releases improve developer experience.

Organizations can improve developer experience by identifying the top points of friction that developers encounter, and then investing in improving areas that will increase the capacity or satisfaction of developers. For example, an organization can focus on reducing friction in development tools in order to allow developers to complete tasks more seamlessly. Even a small reduction in wasted time, when multiplied across an engineering organization, can have a greater impact on productivity than hiring additional engineers.

More companies are focusing on developer experience

A recent study from Gartner revealed that 78% of surveyed organizations have a formal developer experience initiative either established or planned, while a similar study from Forrester showed that 75% of enterprise leaders regarded developer experience as crucial to executing business strategy. These findings indicate a growing recognition of the substantial benefits that can be derived from investing in developer experience programs.

Related studies from McKinsey and Stripe have further validated the business impact of optimizing work environments for developers, and consequently there is an increasing number of organizations establishing C-level initiatives around developer experience.

The three dimensions of developer experience

In their paper, the authors distill the previously identified factors affecting developer experience into three core dimensions: feedback loops, cognitive load, and flow state.

“This framework was informed by our prior research and experience, and seeing the gaps in how organizations approach developer productivity and experience. Our goal was to create a practical framework that would be easy for people to understand and apply, and capture the most important aspects of developer experience.”

To summarize each of the dimensions:

Feedback loops refer to the speed and quality of responses relative to actions performed. Fast feedback loops are a critical component of efficient development processes, as they enable developers to complete their work expeditiously and with minimal friction. On the other hand, slow feedback loops can lead to disruptions in the development cycle, causing developers to become frustrated and delays to occur. Hence, organizations must endeavor to shorten feedback loops by identifying areas where development tools can be accelerated and human hand-off processes can be optimized, such as build and test processes or development environment setup.

Cognitive load encompasses the amount of mental processing required for a developer to perform a task. High cognitive load can result from challenges like poorly documented code or systems, forcing developers to devote extra time and effort to completing their work and avoiding mistakes. To improve the developer experience, teams and organizations should aim to alleviate cognitive load by removing any unnecessary hurdles in the development process.

Flow state refers to the mental state of being fully absorbed and energized while engaged in an activity, characterized by intense focus and enjoyment. This is often referred to as “being in the zone.” Experiencing flow state frequently at work can lead to greater productivity, innovation, and employee development. Similarly, research has shown that developers who derive satisfaction from their work tend to produce higher quality products. Therefore, teams and organizations should focus on creating optimal conditions that promote the flow state to foster employee well-being and performance.

Taken together, these three dimensions encapsulate the full range of friction types encountered by developers. Although developer experience is complex and nuanced, teams and organizations can take steps toward improvement by focusing on these three key areas.

Taken together, these three dimensions encapsulate the full range of friction types encountered by developers. Leaders can surface opportunities to improve productivity by selecting metrics within the three dimensions.

“The DevEx framework provides a way to improve developer productivity in a systematic and developer-centric way. We encourage readers to capture metrics within each of the three dimensions in order to illuminate areas of friction as well as effectively prioritize the areas that will have the biggest impact on the organization’s intended outcomes.”

What to measure

The first task for organizations looking to improve their developer experience is to measure where friction exists across the three previously described dimensions. The authors recommend selecting topics within each dimension to measure, capturing both perceptual and workflow metrics for each topic, and also capturing KPIs to stay aligned with the intended higher-level outcomes.

Measure topics across the three dimensions. For example, an organization may choose to measure test efficiency (Feedback loops), codebase complexity (Cognitive load), the balance of tech debt (Cognitive load), and time for deep work (Flow state). Some topics may map onto more than one dimension.

“We advocate that leaders opt for metrics that span each of the three dimensions to acquire a comprehensive understanding of the developer experience. For instance, a topic that could be evaluated within the Feedback Loops dimension is Test Efficiency, while Codebase Complexity could be measured within the Cognitive Load dimension.”

Capture both perceptual and workflow measures for each topic. Measuring developer experience requires capturing developers’ perceptions – their attitudes, feelings, and opinions – in addition to objective data about engineering systems and processes. This is because neither perceptual nor workflow measures alone can tell the full story.

For instance, a seemingly fast build process may feel disruptive to developers if it regularly interrupts their work progress. Conversely, even if developers feel content with their build processes, using objective measures such as build time may reveal that feedback loops are slower than they could be and development workflows less streamlined than they might be. Hence, analyzing both perceptual and workflow measures is necessary to gain a complete understanding of the points of friction that developers encounter in their everyday work.  

Measure KPIs to stay focused on driving business outcomes that matter. KPIs serve as north-star metrics for DevEx initiatives. Well-designed KPIs should measure the outcomes that the business seeks to drive, including improvements to productivity, satisfaction, engagement, and retention.

How to measure

The authors recommend starting with surveys to capture the above metrics. Surveys provide the advantage of being able to capture all aspects of the developer experience, including KPIs, perceptual measures, and workflow measures.

“Companies like Google, Microsoft, and Spotify have relied on survey-based developer productivity metrics for years. However, designing and administering surveys can be difficult, so we hope our framework provides a good starting point for leaders to follow.”

Given the importance of surveys, the authors outline several important considerations for survey programs to be successful:

  • Design surveys carefully. Poorly designed survey questions lead to inaccurate and unreliable results. At minimum, the authors say that survey questions should be based on well-defined constructs, and rigorously tested in interviews for consistent interpretation.  
  • Break down results by team and persona. A common mistake made by organizational leaders is to focus on company-wide results instead of data broken-down by team and persona (e.g. role, tenure, seniority). Focusing only on aggregate results can lead to overlooking problems that affect small but important populations within the company.
  • Compare results against benchmarks. Comparative analysis can help contextualize data and help drive action. For example, developer sentiment toward tech debt is commonly negative, making it difficult to identify problems or gauge their scale. Benchmarks allow leaders to see when teams have lower sentiment scores than their peers, and when organizations have lower scores than their industry competitors. These signals flag notable opportunities for improvement.
  • Mix in transactional surveys. In addition to periodic surveys, organizations can use transactional surveys to collect feedback based on specific touchpoints. For example, Platform teams can use transactional surveys to prompt developers for feedback when a specific error occurs during the installation of a CLI tool. Transactional surveys provide a continuous stream of feedback and can generate higher quality responses due to the timeliness of their posed questions.
  • Watch out for survey fatigue. Many organizations struggle to sustain high participation rates in surveys over time. A lack of follow-up action commonly causes developers to feel that repeatedly responding to surveys is not a worthwhile exercise. It is therefore critical that leaders and teams follow up on surveys.

Conclusion

The DevEx framework provides a practical framework for understanding developer experience, while the accompanying measurement approaches systematically help guide improvement. Organizations should begin measuring developer experience now, even if they have not yet established or planned a formal DevEx investment.

AIOps: Site Reliability Engineering at Scale

Key Takeaways

  • AIOps can simplify and streamline processes which can reduce the mental burden on employees
  • Another benefit is improved communication and collaboration between departments leading to more efficient use of resources and reduced budget overhead
  • AIOps can simplify implementing measures to minimize downtime, such as improving maintenance schedules or upgrading equipment
  • AIOps can improve customer satisfaction and enhance customer trust while reducing service disruptions.

It was in the 20th century when software began eating the world. In today’s 21st-century environment, its appetite has turned to humans.

Whether it is financial systems, governmental software, or business-to-business applications, one thing remains: these systems are critical to revenue, and in some cases, to human safety. They must remain highly available in the face of technological, natural, and human-made adversity. Enter the Site Reliability Engineer or SRE.

The SRE model was born out of Google when Ben Treynor Sloss established the first team in 2003:

Fundamentally, it’s what happens when you ask a software engineer to design an operations function … So SRE is fundamentally doing work that has historically been done by an operations team, but using engineers with software expertise, and banking on the fact that these engineers are inherently both predisposed to, and have the ability to, substitute automation for human labor.[1]

Since its inception, engineering organizations have adopted this model in various ways, yet the fact remains the same. These engineers support revenue and business-critical operations 24x7x365.

It is challenging to locate, hire, and train SREs. In an ever-changing landscape of infrastructure and buzzwords, it begs the question of how to scale these teams sustainably to ensure the well-being of the team and the continuity of operations. Enter AIOps.

AIOps, or artificial intelligence for IT operations, is a set of technologies and practices that use artificial intelligence, machine learning, and big data analytics to improve the reliability of software systems. AIOps enables cognitive stress reduction, increased cross-functional collaboration, decreased downtime, increased customer satisfaction, and reduced cost overhead.

Reducing Cognitive Overload

The on-call engineer’s cognitive stress problem comes in two forms: alert vs. signal noise and information retrieval.

For anyone who has ever held the proverbial pager (we don’t still use real pagers, do we?), the noise versus signal problem immediately comes to mind when considering cognitive stress factors. This problem explores the balance between actionable alerts and alerts that are too sensitive or noisy. This creates a symptom called alert fatigue.[2]

One of the critical benefits of AIOps is cognitive stress reduction. AIOps systems can automatically identify and diagnose issues and can even predict potential problems before they occur. This can reduce the cognitive load on SRE teams, allowing them to focus on more business-aligned project work rather than spending their time troubleshooting issues. 

Additionally, AIOps systems can assist with the “front door problem” associated with incident triage. Monitoring systems have millions of data points they collect. The quality of information associated with the alert received is human-dependent. Often, this generates a single question when an SRE begins system triage:

“Where do I begin looking to understand the potential blast radius better?”

AIOps systems can assist with this initial triage measure by analyzing potential anomalies in system state and or telemetry data and providing both potential areas to focus on, as well as complimentary documentation sourced on the organizational intranet.

SREs must begin thinking about how to empower the adoption of AIOps in their organizations. While this is yet another tech stack SREs need to learn, the benefits can have exponentially positive results in reducing their overall cognitive load.

Enhancing the Cross-Team Engagement Model

AIOps (Artificial Intelligence for IT Operations) can significantly improve cross-functional engagement in a business. In traditional IT operations, different teams may work in silos, resulting in communication gaps, misunderstandings, and delays in issue resolution. AIOps can help bridge these gaps and facilitate collaboration between different teams.

One way AIOps improves cross-functional engagement is through its ability to provide real-time insights and analytics into various IT processes. This enables different teams to access the same information, which can help improve communication and reduce misunderstandings. For example, the data provided by AIOps can help IT teams and business stakeholders identify potential issues and proactively take action to prevent them from occurring, leading to better outcomes and higher customer satisfaction.

Another way AIOps improves cross-functional engagement is through its ability to automate various IT processes. By automating routine tasks, AIOps can free up time for IT teams to focus on strategic initiatives, such as improving customer experiences and innovating new solutions. This can lead to improved collaboration between IT teams and business stakeholders. Both groups can work together to identify areas where automation can be implemented to improve efficiency and reduce costs.

Overall, AIOps can improve cross-functional engagement by providing real-time insights and analytics, automating routine tasks, and enabling collaboration between different teams. By breaking down silos and improving communication between IT and business stakeholders, AIOps can help businesses deliver more reliable and efficient IT services, leading to better outcomes and higher customer satisfaction.

Reducing Downtime Throughout the SDLC

Another critical benefit of AIOps is decreased downtime. The nature of diagnosing a system degradation or failure involves the performance of computing systems within a constrained environment. The thousands of data inputs involve humans-in-the-loop (HIL) to design additional systems to alert an engineer based on a given set of metrics. Furthermore, the process extends further when an engineer has to read and interpret the data presented to them after an alert is triggered.

Metrics such as time-to-detection and time-to-resolution are an aggregate evaluation of an engineering team’s effectiveness at receiving, interpreting, triaging, and resolving such incidents. All of this can be drastically improved upon by implementing an AIOps system. In critical environments, it may be necessary to maintain a HIL to decide what actions to take inside a company’s infrastructure. All of this can be drastically improved upon by implementing an AIOps system. An AIOps system can intelligently and diligently analyze the streams of data points it ingests consistently while auto-remediating on less critical issues without the interference of a human, while only alerting for the highest severity issues.

Happy Customers, Happy Life

From a customer perspective, AIOps can have a significant impact on their satisfaction with the services they receive. For example, AIOps can help businesses proactively identify and resolve issues before they impact customers. This means that customers are less likely to experience service disruptions or downtime, resulting in improved availability and reliability of services. Additionally, AIOps can help businesses improve the speed and accuracy of incident resolution, which can help minimize the impact of incidents on customers.

Another benefit of AIOps is that it can help businesses identify and resolve issues more quickly, leading to shorter resolution times. This can be particularly important for customers who are experiencing critical issues or downtime. By resolving these issues faster, businesses can minimize the impact on customers and reduce the risk of customer churn.

Overall, AIOps has the potential to significantly improve customer satisfaction by helping businesses deliver more reliable and available IT services, faster incident resolution times, and shorter resolution times. As a senior software engineer, I believe that AIOps is a powerful approach to IT operations that can help businesses stay ahead of the curve in today’s fast-paced and competitive market.

Patching the Leaky Bucket

AIOps can help automate and optimize various IT processes, including monitoring, event correlation, and incident resolution. By automating these processes, AIOps can reduce the need for manual intervention, which can help reduce labor costs. Additionally, by optimizing these processes, AIOps can help companies reduce the time and resources required to manage IT operations, leading to overall cost savings.

This can help companies reduce the number of service disruptions and outages, which can lead to significant cost savings. Downtime and service disruptions can be costly for businesses, resulting in lost productivity, revenue, and customer satisfaction. By detecting and resolving issues before they impact services, AIOps can help minimize the risk of service disruptions and downtime, leading to cost savings for the business.

Additionally, AIOps can help businesses improve their overall IT infrastructure and application performance. By providing real-time insights into application and infrastructure performance, AIOps can help companies optimize their resources and reduce inefficiencies. This can lead to cost savings by reducing the need for additional hardware and software resources.

A quick internet search will reveal the average salary of a Software Engineer in the United States is $90,000 – $110,000 USD. This roughly equates to $47 – $57 an hour. Imagine, on average, your incidents involve five engineers, and it takes you three hours to resolve an issue. That means your incidents cost you $705 – $855 per incident. Now imagine you have three incidents a month, bringing you to approximately $30,780 a year in costs. This doesn’t include any customer revenue loss or the intangible costs of losing customer trust. There are a few essential questions to ask yourself to get a rough estimate of how much an incident costs your company. 

  1. How much are engineers paid at my company?
  2. How many incidents do we have in a year?
  3. How long does it take us to resolve those incidents?
  4. What are the intangible costs to our company because of incidents?

Once you do this back-of-envelope math, you’ll quickly understand how even a 10% decrease in incidents will save your company an impressive amount of money on the bottom line.

Where to Start

The truth is, adopting AIOps is a long journey for any organization. However, with persistence and focus, a company can realize the benefits discussed earlier in this article. Here are a few considerations to get started on your adoption of AIOps. 

  1. Define your goals: The first step is to determine what you want to achieve with AIOps. This can include reducing downtime, improving incident response times, or optimizing resource utilization.
  2. Assess your current IT infrastructure: Before implementing AIOps, you need to understand your existing IT infrastructure, including the tools and technologies you currently use. This will help you identify any gaps that AIOps can fill and ensure that your AIOps program integrates smoothly with your existing systems.
  3. Choose an AIOps platform: There are many AIOps platforms available in the market. Evaluate different options and choose a platform that aligns with your goals and IT infrastructure. Look for features such as automated root cause analysis, anomaly detection, and machine learning algorithms.
  4. Identify data sources: AIOps platforms require a significant amount of data to operate effectively. Identify the data sources you will need to collect, such as log files, performance metrics, and configuration data.
  5. Develop a data strategy: Determine how you will collect, store, and manage the data required for AIOps. This includes deciding on data retention policies, data security measures, and data access controls.
  6. Train your AIOps platform: Once you have set up your AIOps platform and data strategy, you will need to train the platform to recognize patterns and anomalies in your IT infrastructure. This involves feeding historical data into the platform and tweaking the algorithms to optimize performance.
  7. Integrate with your IT operations: Finally, you will need to integrate your AIOps program with your IT operations. This includes setting up workflows for incident management, change management, and resource allocation.

Conclusion

In conclusion, AIOps is a set of technologies and practices that use artificial intelligence, machine learning, and big data analytics to improve the reliability of software systems. AIOps enables cognitive stress reduction, increased cross-functional collaboration, decreased downtime, increased customer satisfaction, and reduced cost overhead. These benefits can be achieved by automating incident management processes, providing real-time visibility into the performance of software systems, and optimizing resource allocation.

References

  1. Google Interview
  2. Want to Solve Over-Monitoring and Alert Fatigue? Create the Right Incentives!” [Kishore Jalleda, Yahoo, USENIX SREcon17]

How Not to Use the DORA Metrics to Measure DevOps Performance

Key Takeaways

  • With metrics teams must remember Goodhart’s law: “When a measure becomes a target, it ceases to be a good measure.”
  • Low-performing teams take a hit on stability when they try to increase their deployment frequency simply by working harder.
  • Driving improvements in the metric may lead to taking shortcuts with testing causing buggy code or producing brittle software quickly.
  • A high change failure rate may reduce the effectiveness of the other metrics in terms of measuring progress toward continuous delivery of value to your customers.

Since 2014, Google’s DevOps Research and Assessment (DORA) team has been at the forefront of DevOps research. This group combines behavioural science, seven years of research, and data from over 32,000 professionals to describe the most effective and efficient ways to deliver software. They have identified technology practices and capabilities proven to drive organisational outcomes and published four key metrics that teams can use to measure their progress. These metrics are:

  1. Deployment Frequency
  2. Lead Time for Changes
  3. Mean Time to Recover
  4. Change Failure Rate

In today’s world of digital transformation, companies need to pivot and iterate quickly to meet changing customer requirements while delivering a reliable service to their customers. The DORA reports identify a range of important factors which companies must address if they want to achieve this agility, including cultural (autonomy, empowerment, feedback, learning), product (lean engineering, fire drills, lightweight approvals), technical (continuous delivery, cloud infrastructure, version control) and monitoring (observability, WIP limits) factors. 

While an extensive list of “capabilities” is great, for software teams to continually improve their processes to meet customer demands they need a tangible, objective yardstick to measure their progress. The DORA metrics are now the de facto measure of DevOps success for most and there’s a consensus that they represent a great way to assess performance for most software teams, thanks to books like Accelerate: The Science of Lean Software and DevOps (Forsgren et al, 2018) and Software Architecture Metrics (Ciceri et al, 2022).

But when handling metrics, teams must always be careful to remember Goodhart’s law: “When a measure becomes a target, it ceases to be a good measure.” The danger is that metrics become an end in themselves rather than a means to an end.

Let’s explore what this might look like in terms of the DORA metrics — and how you can avoid pulling the wool over your own eyes.

Deployment Frequency

For the primary application or service you work on, how often does your organisation deploy code to production or release it to end users?

At the heart of DevOps is an ambition that teams never put off a release simply because they want to avoid the process. By addressing any pain points, deployments cease to be a big deal, and your team can release more often. As a result, value is delivered sooner, more incrementally, allowing for continuous feedback from end users, who then shape the direction of travel for ongoing development work.

For teams who are currently only able to release at the end of a biweekly sprint or even less often, the deployment frequency metric hopefully tracks your progress toward deployments once a week, multiple times a week, daily, and then multiple times a day for elite performers. That progression is good, but it also matters how the improvements are achieved.

What does this metric really measure? Firstly, whether the deployment process is continuously improving, with obstacles being identified and removed. Secondly, whether your team is successfully breaking up projects into changes that can be delivered incrementally. 

As you celebrate the latest increase in deployment frequency, ask yourself: are our users seeing the benefit of more frequent deployments? Studies have shown that low-performing teams take a big hit on stability when they try to increase their deployment frequency simply by working harder (Forsgren, Humble, and Kim, 2018). Have we only managed to shift the dial on this metric by cracking the whip to increase our tempo?

Lead Time for Changes

For the primary application or service you work on, what is your lead time for changes (that is, how long does it take to go from code committed to code successfully running in production)?

While there are a few ways of measuring lead times (which may be equivalent to or distinct from “cycle times,” depending on who you ask), the DORA definition is how long it takes from a feature being started, to a feature being in the hands of users.

By reducing lead times, your development team will improve business agility. End users don’t wait long to see the requested features being delivered. The wider business can be more responsive to challenges and opportunities. All this helps improve engagement and interplay between your development team, the business, and end users.

Of course, reduced lead times go hand in hand with deployment frequency. More frequent releases make it possible to accelerate project delivery. Importantly, they ensure completed work doesn’t sit around waiting to be released.

How can this metric drive the wrong behaviour? If your engineering team works towards the metric rather than the actual value the metric is supposed to measure, they may end up taking shortcuts when it comes to testing and releasing buggy code, or code themselves into a corner with fast but brittle approaches to writing software. 

These behaviours produce a short-term appearance of progress, but a long-term hit to productivity. Reductions in lead times should come from a better approach to product management and improved deployment frequency, not a more lax approach to release quality where existing checks are skipped and process improvements are avoided.

Mean Time to Recover

For the primary application or service you work on, how long does it generally take to restore service when a service incident or a defect that impacts users occurs (for example, unplanned outage, service impairment)?

Part of the beauty of DevOps is that it doesn’t pit velocity and resilience against each other but makes them mutually beneficial. For example, frequent small releases with incremental improvements can more easily be rolled back if there’s an error. Or, if a bug is easy to identify and fix, your team can roll forward and remediate it quickly. 

Yet again, we can see that the DORA metrics are complementary; success in one area typically correlates with success across others. However, driving success with this metric can be an anti-pattern – it can unhelpfully conceal other problems. For example, if your strategy to recover a service is always to roll back, then you’ll be taking value from your latest release away from your users, even those that don’t encounter your new-found issue. While your mean time to recover will be low, your lead time figure may now be skewed and not account for this rollback strategy, giving you a false sense of agility. Perhaps looking at what it would take to always be able to roll forward is the next step on your journey to refine your software delivery process. 

It’s possible to see improvements in your mean time to recovery (MTTR) that are wholly driven by increased deployment frequency and reduced lead times. Alternatively, maybe your mean time to recovery is low because of a lack of monitoring to detect those issues in the first place. Would improving your monitoring initially cause this figure to increase, but for the benefit of your fault-finding and resolution processes? Measuring the mean time to recovery can be a great proxy for how well your team monitors for issues and then prioritises solving them. 

With continuous monitoring and increasingly relevant alerting, you should be able to discover problems sooner. In addition, there’s the question of culture and process: does your team keep up-to-date runbooks? Do they rehearse fire drills? Intentional practice and sufficient documentation are key to avoiding a false sense of security when the time to recover is improving due to other DevOps improvements.

Change Failure Rate

For the primary application or service you work on, what percentage of changes to production or releases to users result in degraded service (for example, lead to service impairment or service outage) and subsequently require remediation (for example, require a hotfix, rollback, fix forward, patch)?

Change failure rate measures the percentage of releases that cause a failure, bug, or error: this metric tracks release quality and highlights where testing processes are falling short. A sophisticated release process should afford plenty of opportunities for various tests, reducing the likelihood of releasing a bug or breaking change.

Change failure rate acts as a good control on the other DORA metrics, which tend to push teams to accelerate delivery with no guarantee of concern for release quality. If your data for the other three metrics show a positive trend, but the change failure rate is soaring, you have the balance wrong. With a high change failure rate, those other metrics probably aren’t giving you an accurate assessment of progress in terms of your real goal: continuous delivery of value to your customers.

As with the mean time to recover, change failure rate can—indeed should—be positively impacted by deployment frequency. If you make the same number of errors but deploy the project across a greater number of deployments, the percentage of deployments with errors will be reduced. That’s good, but it can give a misleading sense of improvement from a partial picture: the number of errors hasn’t actually reduced. Perhaps some teams might even be tempted to reduce their change failure rate by these means artificially!

Change failure rate should assess whether your team is continuously improving regarding testing. For example, are you managing to ‘shift left’ and find errors earlier in the release cycle? Are your testing environments close replicas of production to effectively weed out edge cases? It’s always important to ask why your change failure rate is reducing and consider what further improvements can be made.

The Big Picture Benefits of DevOps

Rightfully, DORA metrics are recognized as one of the DevOps industry standards for measuring maturity. However, if we think back to Goodhart’s Law and start to treat them as targets rather than metrics, you may end up with a misleading sense of project headway, an imbalance between goals and culture, and releases that fall short of your team’s true potential. 

It’s difficult to talk about DORA metrics without having the notion of targets in your head; that bias can slowly creep in and before long you’re unknowingly talking about them in terms of absolute targets. To proactively avoid this slippery slope, focus on the trends in your metrics – when tweaking your team’s process or practices, relative changes in your metrics over time give you much more useful feedback than a fixed point-in-time target ever will; let them be a measure of your progress.

If you find yourself in a team where targets are holding you hostage from changing your process, driving unhelpful behaviours, or so unrealistic that they’re demoralising your team, ask yourself what context is missing that makes them unhelpful. Go back and question what problem you’re trying to solve – and are your targets driving behaviours that just treat symptoms, rather than identifying an underlying cause? Have you fallen foul of setting targets too soon? Remember to measure first, and try not to guess.

When used properly, the DORA metrics are a brilliant way to demonstrate your team’s progress, and they provide evidence you can use to explain the business value of DevOps. Together, these metrics point to the big-picture benefits of DevOps: continuous improvements in the velocity, agility, and resilience of a development and release process that brings together developers, business stakeholders, and end users. By observing and tracking trends with DORA metrics, you will have made a good decision that facilitates your teams and drives more value back to your customers.

Dark Side of DevOps – the Price of Shifting Left and Ways to Make it Affordable

Key Takeaways

  • DevOps adoption and Shifting Left movement empower developers, but also create additional cognitive load for them
  • A set of pre-selected best practices, packaged as a Paved Path – can alleviate some of the cognitive load without creating unnecessary barriers
  • However, as companies evolve, Paved Path(s) have to evolve to keep up with changes in technology and business needs – one Paved Path may not be enough
  • Eventually a company needs to separate responsibilities between experts (responsible for defining Paved Paths) and developers (who can use Paved Paths to remove routine and concentrate on solving business problems)
  • It is important to evaluate which stage of DevOps journey your company is at, so no effort is wasted on solving a “cool” problem instead of the problem you actually have

Topics like “you build it, you run it” and “shifting testing/security/data governance left” are popular: moving things to the earlier stages of software development, empowering engineers, and shifting control bring proven benefits.

Yet, what is the cost? What does it mean for the developers who are involved?

The benefits for developers are clear: you get more control, you can address issues earlier in the development cycle, and you shorten the feedback loop. However, your responsibilities are growing beyond your code  – now they include security, infrastructure and other things that have been “shifted left”. That’s especially important since the best practices in those areas are constantly evolving – the demand of upkeep is high (and so is the cost!).

What are the solutions that can help you keep DevOps and Shifting Left? What can we do to break a grip of the dark side? Let’s find out!

The Impact on Developers of Shifting Left Activities

When we shift some of the software development lifecycle activities left, that’s it; when we move them to the earlier stage of our software development process, we can empower developers. According to the State of DevOps report, companies with best DevOps practices show more than 50% reduction in change failure rates, despite having higher deployment frequency (multiple deployments a day for top performers vs. deploying changes once in 6+ months for bottom performers).

The reason is obvious – the developers in top-performing companies do not have to suffer from the long feedback cycle – for example, when we shift testing from the “Deploy & Release” stage to the “Develop & Build” stage, it means that the developers won’t have to wait days or even weeks for the QA to verify their changes, they can catch bugs earlier. If we shift testing further left – to the “Plan & Design stage”, it means that developers won’t have to spend their time building code that is defective by design.

However, it is not a silver bullet – Shifting Left also means that developers have to learn things like testing methodologies and tools (e.g., TDD, JUnit, Spock, build orchestration tools like GitHub Actions).  

On top of that, Shifting Left doesn’t stop with shifting left testing – more and more things are shifted left, for example security or data governance. All this adds to the developers’ cognitive load. Developers have to learn this tool, they have to adopt best practices and they need to keep their code in infrastructure up-to-date as those best practices change.

Growth of Responsibilities

On the one hand, not having a gatekeeper feels great. Developers don’t have to wait for somebody’s approval – they can iterate faster and write better code because their feedback loop is shorter, and it is easier to catch and fix bugs.

On the other hand, the added cognitive load is measurable – all the tools and techniques that developers have to learn now require time and mental effort. Some developers don’t want that – they just want to concentrate on writing their own code, on solving business problems.

For example, one developer may be happy to be able to experiment with deployment tools, to be able to migrate deployment pipeline from Jenkins to Spinnaker to get native out-of-the box canary support. However, other developers may not be so excited about having to deal with those tools, especially if they already have a lot on their plate.

Steps in a DevOps Journey – Ad-Hoc DevOps Adoption

These additional responsibilities don’t come for free. The actual cost, though, depends on the size of the company and on the complexity of its computing infrastructure. For smaller companies, where a handful of people are working together on just a few services/components, the cognitive load is minimal. After all, everybody knows everybody else and what they are working on; exchanging context doesn’t take a lot of effort. In this condition, removing artificial barriers and empowering developers by shifting some of the SDLC activities left can bring immediate benefits.

Even in large companies like Netflix there is room for manually managed solutions. For example, if a particular application has caches that have to be warmed up before it can start accepting production traffic, then a team that owns that application can create a manually managed deployment pipeline to make sure that a custom warm-up period is observed before the application can start taking traffic (so a redeployment won’t cause a performance degradation). 

Steps in a DevOps Journey – Paved Path

However, as companies grow, so does the complexity of their IT infrastructure. Maintaining dozens of interconnected services is not a trivial task anymore. Even locating their respective owners is not so easy. At this point, companies face a choice – either reintroducing the gatekeeping practices that negatively affect productivity, or to provide a paved path – a set of predefined solutions that codifies the best practices, and takes away mental toil, allowing developers to concentrate on solving business problems.

Creating and maintaining a paved path requires investment- someone has to identify the pain points, organise tooling for the paved path that developers can interact with, write documentation, invest in developers’ education. Someone needs to monitor the outliers – if the applications that are not using the paved path solution perform better, maybe it is worth borrowing whatever they are doing and incorporating it into the paved path?

The investment into the paved path is balanced by decreasing the cognitive load for developers. Developers can stop worrying about all the things that are being shifted left and start concentrating on delivering value, on solving core business problems. For example, they can rely on CLI tools (as Netflix’s newt) or internal developer portal (like Spotify’s Backstage) to bootstrap all required infrastructure in one click – no more manual set up of GitHub repositories, CI/CD pipelines and cloud resources like AWS security groups and scaling policies. All this can – and should be – provided out-of-the-box.

On top of that, if the experts take care of migrations, developers don’t have to deal with the cognitive load caused by those migrations. For example, when streaming backend services Netflix had to be updated to pick up a new version of AWS instance metadata API to improve security – the existing paved path was changed transparently for the developers. Encapsulating AWS settings as code allowed rolling out the change for the services that used the paved path with zero additional cognitive load and zero interruptions.

Steps in a DevOps Journey – Multiple Composable Paved Paths

Finally, as the companies keep growing, one paved path cannot encompass all the diversity of best practices for the services and components that are being developed. Multiple paved paths have to be created and maintained, and common parts between these paths have to be identified and reused, requiring additional investment.

For example, Netflix streaming services have to serve millions of users, so their latency and availability requirements are drastically different from internally facing tools that are used by Netflix employees. Both groups will have different best practices – the streaming apps will benefit from canaries and red-black deployments, which don’t make much sense for the internal tools (since they may not have enough traffic to provide a reliable signal for canaries). On the other hand, these two groups may have common practices, like continuous testing and integration, and it makes sense to codify those practices as separate building blocks to be reused across different paved paths.

Next Steps in YOUR DevOps Journey

The first step is identifying the actual problem caused by shifting left. If you have a small cross-functional team, the hardest problem may be identifying what you need to shift left. For example, your team may be suffering from deployment failures- in this case you may need to invest into testing and CI/CD pipelines.

Even if you don’t find the right solution right away, migrating from one solution to another does not require that much time and effort. The worst that can happen is choosing the wrong pain point, choosing the wrong problem to solve. However, if you find the right leverage point, then even an imperfect solution can improve things. For example, if you realise that Jenkins-based CI pipelines are hard to manage and it would be better to migrate to Github Actions, this kind of migration is not prohibitive, when you just deal with a handful of services.

The actual developer experience plays a crucial role here – it always makes sense to listen to developers, to observe their work so you can find out what the main problem is.

Without that, you may end up chasing the latest trend instead of problems that you actually have. I’ve seen people deciding to copy practices that are popular in large companies, and sometimes it is the right call, but often you don’t even have the same kind of problem that other companies do. Sometimes things that are antipatterns for other companies can be a right fit for you. For example, you may not need to invest in microservices orchestration when your problem is simple enough to be solved with a monolith (even though very often a monolith is a genuine anti-pattern).

Another thing to consider is a build-vs-buy problem. Many companies suffer from Not Invented Here syndrome, coming up with custom solutions instead of choosing a third party tool. On the other hand, even third party tools have to be integrated with each other and existing internal tools to provide a seamless developer experience. (Personally, I think if the problem is either unique to your company or close to your company’s core business, it may be worth investing in custom solutions.)

Finally, to make sure you’re heading in the right direction, it makes sense to keep track of metrics like Deployment Frequency (how often do you deploy to production? Once a month? Once a week? Daily?) and Change Failure Rate (how often do your deployments break?). Another set of metrics to monitor is time to bootstrap a service and time to reconfigure a service (or a fleet of services).

Together, those metrics can help you estimate your ability to deliver quality code as well as the time developers spend on non-coding activities – and hopefully, they can help you make the next step in your DevOps journey.

Assessing Organizational Culture to Drive SRE Adoption 

Key Takeaways

  • SRE adoption is greatly influenced by the organizational culture at hand. Therefore, assessing the organizational culture is an important step to be done at the beginning of an SRE transformation.
  • The Westrum model of organizational cultures can be used to assess an organization’s culture from the production operations point of view. The six aspects of the model – cooperation level, messenger training, risk sharing, bridging, failure handling and novelty implementation – relate directly to SRE.
  • Westrum’s performance-oriented generative cultural substrate turned out to be a fertile ground for driving SRE adoption and achieving high performance in SRE.
  • Subtle culture changes in the teams during SRE adoption accumulate to a bigger organizational culture change where production operations is viewed as a collective responsibility because different roles in different teams are aligned on operational concerns.
  • Both formal and informal leadership need to work together to achieve the SRE culture change providing consistency, steadiness and stability amidst the very dynamic nature of the change at hand.

Introduction

The teamplay digital health platform and applications at Siemens Healthineers is a large distributed organization consisting of 25 teams owning many different digital services in the healthcare domain.

The organization underwent an SRE transformation, a profound sociotechnical change that switched the technology, process and culture of production operations. In this article, we focus on:

  • How the organizational culture was assessed in terms of production operations at the beginning of the SRE transformation
  • How a roadmap of small culture changes accumulating over time was created, and
  • How the leadership facilitated the necessary culture changes

The need to assess the organizational culture

When it comes to introducing SRE, it is easy to jump into the tech part of the change and start working on implementing new tools, infrastructure and dashboards.

Undoubtedly necessary, these artifacts alone are not sufficient to sway an organization’s approach to production operations. An SRE transformation is profoundly a sociotechnical change.

The “socio” part of the change needs to play an equal role from the beginning of the SRE transformation.

In this context, it is useful to assess the organization’s current culture, viewing it from the lens of production operations. This holds the following benefits:

  • a) It enables the SRE coaches driving the transformation to understand current attitudes towards production operations in the organization
  • b) It reveals subtle, sometimes hardly visible, ways the organization operates in terms of information sharing, decision-making, collaboration, learning and others that might speed up of impede the SRE transformation
  • c) It sparks ideas about how the organization might be evolved towards SRE and enables first projections of how fast the evolution might go

Given these benefits, how to assess the organizational culture from the production operations point of view? This is the subject of the next section.

How to assess the organizational culture?

A popular topology of organizational cultures is the so-called Westrum model by Ron Westrum. The model classifies cultures as pathological, bureaucratic or generative depending on how organizations process information:

  • Pathological cultures are power-oriented
  • Bureaucratic cultures are rule-oriented, and
  • Generative cultures are performance-oriented

Based on the Westrum model, Google’s DevOps Research and Assessment (DORA) program found out through rigorous studies that generative cultures lead to high performance in software delivery. According to the Westrum model, the six aspects of the generative high performance culture are:

  1. High cooperation
  2. Messengers are trained
  3. Risks are shared
  4. Bridging is encouraged
  5. Failure leads to inquiry
  6. Novelty is implemented

These six aspects can be used to assess an organization’s operations culture. To approach this, the six aspects need to be mapped to SRE in order to understand the target state of culture. The table below, based on my book “Establishing SRE Foundations“, provides this mapping.

 

  Westrum’s generative culture Relationship to SRE
1. High cooperation SRE aligns the organization on operational concerns. This is only possible if a high cooperation is established between the product operations, product development and product management. Executives cooperate with the software delivery organization by supporting SRE as the primary operations methodology. This is necessary to achieve standardization leading to economies of scale justifying investment in SRE.
2. Messengers are trained SRE quantifies reliability using SLOs. Once corresponding error budgets are exhausted, the teams owning the services are trained on how to improve reliability. Moreover, the people on-call are trained to be effective at being on-call, which includes acting quickly to reduce the error budget depletion during outages. Postmortems after outages are viewed as a learning opportunity for the organization.
3. Risks are shared Product operations, product development and product management agree on SLIs that represent service reliability well from the user point of view, on SLOs that represent a threshold of good reliability UX and on the on-call setup required to run the services within the defined SLOs. This leads to shared-decision making on when to invest in reliability vs. new features to maximize delivered value. Thus, the risks of the investments are shared.
4. Bridging is encouraged SLO and SLA definitions are public in the organization, so is the SLO and SLA adherence data per service over time. This leads to data-driven reliability conversations among teams about reliability of dependent services. An SRE community of practice (CoP) is cross-pollinating SRE best practices among the teams and organizing organization-wide lunch & learn sessions on reliability.
5. Failure leads to inquiry Postmortems after outages are used for blameless inquiry into what happened with a view to generate useful learnings to be spread throughout the organization.
6. Novelty is implemented New insights from ongoing product operations, outages and postmortems lead to a timely implementation of new reliability features prioritized against all other work according to error budget depletion rates.

With the target culture state defined in the table above, the SRE coaches can analyze how far away from it their organization currently is.

Accumulating small culture changes over time

When the SRE coaches understood the status quo, we began the SRE transformation activities. These will include technical, process and behavior changes. To fuel the movement, the SRE coaches need to look for small behavior changes, celebrate them and stagger them in such a way that they accumulate over time.

For example, the following order of small changes can incrementally lead to bigger behavior changes over time pushing the culture more and more toward the target state outlined in the previous section.

 

#   Change Culture impact  Culture impact accumulation over time
1 Putting SRE on the list of bigger initiatives the organization works on Awareness of SRE and its promise at all levels of the organization Acceptance of potential usefulness of SRE, open-mindedness to SRE
2 Establishing SRE coaches Perception of SRE as a serious bigger initiative being driven by dedicated responsibles throughout the organization SRE go-to people are known in the organization
3 Setting initial SLOs The first reliability quantification is undertaken; new thinking of reliability as something being quantified is induced SRE has its concepts. The central concept of SLO is now something we define for our services
4 Reacting to alerts on SLO breaches Developers no longer do only coding but also spend time monitoring their services in production Breaching the defined SLOs leads to alerts that developers spend time analyzing. Thus, the SLOs need to be very carefully designed to reflect the customer experience. Lots of SLO breaches lead to lots of time being spent on their analysis!
5 Setting up alert escalation policies  An SLO breach alert is so significant that it must reach someone who can react to it  Reaction to an SLO breach needs to happen in a timely manner, otherwise an escalation policy kicks in!
6 Implementing incident classification Incidents need classification to drive appropriate mobilization of people in the organization Mobilizing people to troubleshoot an incident happens depending on the incident classification
7 Implementing incident postmortems Incidents warrant spending time on understanding what really happened, why and how to avoid the same incident from happening again in future Incidents do not just come and go. Rather, they are carefully analyzed after being solved, inducing a learning cycle into the organization
8 Setting up error budget policies Error budget consumption is tracked. Once it hits a certain threshold, it becomes subject to a predefined policy of action Lots of SLO breaches can accumulate to significant error budget consumption. There is a policy to ensure the error budget consumption does not exceed some thresholds
9 Setting up error budget-based decision-making Prioritization decisions about reliability are based on data from production tracking the error budget consumption over time Different people at different levels of the organization use the error budget consumption data to steer reliability investments
10 Implementing organizational structure for SRE SRE is so widely established in the organization that a formal structure with roles, responsibilities and organizational units is established  SRE is a standard operations methodology now that is even reflected in organizational structure and processes

The culture changes outlined in the table above are driven using an interplay of formal and informal leadership. These dynamics are described in the next section.

Interplay of formal and informal leadership

In every hierarchical organization, there are leaders who possess formal authority due to their placement in the organizational chart. If these leaders are trusted by the broader organization, they enjoy a multiplication effect on their efforts thanks to a large following of people in the organization.

At the same time, lots of hierarchical organizations have informal leaders who do not possess formal authority because they do not have a prominent place in the organizational chart. They have, however, earned trust from the overall organization. This trust enables them to also enjoy a multiplication effect on their efforts because a large number of people in the organization follow them voluntarily.

In the table below, formal and informal leadership types are summarized.

 

  Supporting the SRE transformation Detrimental to SRE transformation Supporting the SRE transformation
Leadership type  Formal leadership enjoying trust from the organization Formal leadership without trust from the organization Informal leadership enjoying trust from the organization
Following type A large following of people in the organization, which is both voluntary and authority-based A following of people based on formal authority A large voluntary following of people in the organization

 A good combination of leadership described on the very left and very right columns provides the necessary environment to push SRE through the organization appropriately proportionally in the top-down and bottom-up manner. It caters for required consistency, steadiness and stability in the very dynamic nature of the SRE transformation. The teams feel that the formal leadership supports SRE while informal leaders help drive the necessary mindset, technical and process changes throughout the organization. This maximizes the chances of success for the SRE transformation.

From the trenches

The culture assessment method described above helped the Siemens Healthineers digital health platform organization successfully evolve operations towards SRE. In this section, we present a few real learnings from the trenches of our SRE transformation.

Learning 1: Involve the product owners from the beginning

One of the most profound things we got right was to involve the product owners in the SRE transformation from the beginning. The SRE value promise for the product owners is to reduce customer escalations they might experience due to the digital services not working as expected. The escalations are annoying, time-consuming and causing unwanted management attention. This provides motivation to the product owners to attend SRE meetings where the SLOs are defined and associated processes are discussed.

The product owners in SRE meetings:

  • Provided context of the most important customer journeys from the business point of view
  • Assessed the business value of higher reliability at the cost discussed in the meetings
  • Got closer to production operations by being involved in SRE discussions from the start
  • Developed an understanding of how to prioritize investments in reliability vs. features in a data-driven way

Learning 2: Get the developers’ attention onto production first

The major problem with organizations new to software as a service is that developers are not used to paying attention to production. Rather, traditionally their world starts with a feature description and ends with a feature implementation. Running the feature in production is out of scope. This was the case with our organization at the beginning of the SRE transformation.

In this context, the most important impactful milestone to achieve at the beginning of the SRE transformation was to channel the developers’ attention onto production. This was an 80/20 kind of milestone, where 20% of the effort yields 80% improvement.

It was less important to get the developers to be perfect about their SLO definitions, error budget policy specifications, etc. Rather, it was about supplying the developers with the very basic tools and the initial motivation to move their attention to production. Regularly spending time in some production analyses was half the battle when acquiring the new habit of operating software.

Once there, the accuracy of applying the SRE methodology could be brought about step by step.

Learning 3: Do not fear letting the team fail fast at first

When it comes to the initial SLO definitions, our experience was that teams tended to overestimate the reliability of their services at first. They tended to set higher availability SLOs than the services have on average. Likewise, they tended to set stricter latency SLOs than the service can fulfill.

Convincing the teams at this initial stage to relax the initial SLOs was futile. Even the historical data sometimes did not convince the teams. We found that a fail fast approach was actually working best.

We set the SLOs as suggested by the teams, without much debate. Unsurprisingly, the teams got flooded with alerts on SLO breaches. Inevitably, the big topic of the next SRE meeting was the sheer number of alerts the team cannot process.

This made the team fully understand the consequences of their SLO decisions. In turn, the SLO redefinition process got started. And this was exactly what was needed: a powerful feedback loop from production on whether the services fulfill the SLOs or not, leading to a reevaluation of the SLOs.

Learning 4: Build a coalition of formal and informal leaders

We found it very useful to have a coalition of formal and informal leaders championing SRE in the organization. The informal leaders were self-taught about SRE and bursting with energy to introduce it in the organization. To do so, they required support from the formal leadership to commit capacity in the teams for SRE work.

The informal leaders needed to sell SRE to the formal leaders on the promise of reducing customer escalations due to service outages. These conversations happened with the head of R&D and head of operations. In turn, these leaders needed to sell SRE to the entire leadership team so that the topic gets put onto a portfolio list of big initiatives undertaken by the organization.

Once that happened, there was a powerful combination of enough formal leaders supporting SRE, SRE being on the list of big initiatives undertaken by the organization and an energized group of informal leaders ready to drive SRE throughout the organization.

This organizational state was conducive to achieving successful production operations using SRE!

Summary

An SRE transformation is a large sociotechnical change for a software delivery organization that is new to or just getting started with digital services operations. The speed of the change is largely determined by the organizational culture at hand. It is people’s attitudes to and views about production operations that are the highest mountains to move, not the tools and dashboards used by the people on a daily basis.

Therefore, assessing the organizational culture before embarking on the SRE transformation is a useful exercise. It enables the SRE coaches driving the transformation to understand where the organization currently is in terms of operations culture. It further ignites a valuable thinking process of how it might be possible to evolve the culture towards SRE.

When DevOps Meets Security to Protect Software

Key Takeaways

  • Since security professionals are scarce and outnumbered by developers, automation and DevSecOps practices are key to building secure software.
  • Security is not an afterthought: DevSecOps emphasizes the importance of integrating security into every stage of the development process.
  • Collaboration is key: DevSecOps requires collaboration between development, operations, and security teams to ensure security is considered throughout the development process.
  • Software supply chain compromise is an emerging security issue that must be addressed.
  • State-sponsored actors have added complexity to the ever-evolving threat landscape, now more than ever, organizations need to ensure security at every stage of the SDLC. DevSecOps practices largely boost this requirement.

I have now been working in cybersecurity for about two years and the lessons have been immense. During this time, my morning routine has largely remained unchanged; wake up, get some coffee, open my Spotify app, and get my daily dose of CyberWire Daily news. Credit where due, Dave Bittner and his team have done an amazing job with the show. The one thing that has remained constant over the years is the cyber attacks, the data breaches, and the massive thefts and sale of personally identifiable information of many people on the dark web.

We continue to hear the outcry from the cybersecurity world about the acute shortage of cyber talent. Many organizations are putting in place initiatives to try and fill this gap and even train and retain cybersecurity talent. The truth though is that the developers outnumber security professionals tremendously. So how did we get here? And more importantly, what is the industry doing to address these cybersecurity concerns of their platforms and products?

Historically in many organizations security has been treated as an afterthought. It has always been one of those checklist items at the end of the development and given the least priority and effort. The development team works on their software together with the operations team for deployment and maintenance. This was mostly successful, and became widely known as the DevOps process; a collaboration between the developers and the operations teams, working side by side to build, ship, and maintain software. However, security was not often a priority with DevOps. In addition, software development, deployment, and maintenance have continually increased in complexity at scale.

Nowadays, security can no longer be an afterthought, and this has become the general consensus among most professionals in the technology space. In fact, Hackerone has noted that fixing security defects in production is much more expensive than in development. It is becoming a standard practice in the SDLC to ensure security is a consideration during development.

Security and privacy are, more than ever, necessary components of all software. This is a challenge as it is not easy to change age-old processes overnight. But the change is needed as security incidents grow in volume and data breaches continue to be extremely expensive for organizations. A lot of work is being put in place to ensure security is ingrained in all products right from ideation. This is commonly referred to as shifting security left or DevSecOps.
Most companies are committed to churning out secure software to their customers. Currently, the speed at which new software is produced is lightning-fast. In many settings, developers are pushing new updates to products at hourly intervals. There needs to be an intricate balance between ensuring that the speed of innovation is not slowed down and that the security of products is not compromised. In recent years, there has been a huge push, even by governments, for better security in the SDLC processes. This has given rise to new practices in application and software supply chain security.

The Challenges

What are some of the challenges currently with how everything is set up? If organizations are committed to security and privacy, why are they still developing vulnerable software? This is not a question with a single answer. There are a lot of possibilities, but for the sake of this article, I will focus on a few points that most people tend to agree with.

The “Cybersecurity Tech Talent Crisis”

There have been reports and never-ending discussions about the shortage of experienced cybersecurity professionals. This is a contributing factor to the security challenges experienced by most organizations.

Advanced Persistent Threats

Nation-state-sponsored threat actors have become a normal thing in the recent past. These sophisticated actors continue to wreak havoc stealthily, breaching defenses and causing chaos along the way.

Attribution largely remains a complex topic, but most security researchers have pointed fingers at certain Asian states and Eastern European state actors. Andy Greenberg’s book Sandworm provides great details and insights into this with first-hand accounts from Robert M. Lee, a renowned Industrial control systems security engineer.

Ransomware Attacks

In the past couple of years, there has been an increase in ransomware attacks on various institutions and infrastructure. The crooks are into breaking into networks, encrypting data, and demanding colossal amounts of money in order to provide the decryption keys. Conti ransomware gang is perhaps the most prolific actor out there, affiliated with some high-profile and particularly damaging attacks.

Cloud Security Risks

Most of us probably still remember the Capital One hack. Cloud misconfigurations continue to present organizations with a huge security challenge. The cloud has been described by certain people as the wild west if proper guardrails are not properly put in place.

Software Supply Chain Attacks

This topic has attracted a lot of attention lately. Some of the common ways employed by bad actors to carry out these attacks include; compromising software building tools and/or infrastructure used for updates, as in the case of SolarWinds which is widely believed to have been conducted via a compromised FTP server, compromised code being baked into some hardware, or firmware, and stolen code signing certificates which are then used to sneak malicious applications into the download stores and package repositories.

DevSecOps and Software Supply Chain Security to the Rescue?

Is DevSecOps the answer to all the security challenges? Sadly, the answer is no, there is no single magic bullet that addresses all the security challenges faced by different organizations. However, it is agreeable that DevSecOps practices and securing the software supply chain play a pivotal role in greatly reducing software and application security risks.

DevSecOps advocates for knitting security within the software development process, from inception all the way to release. Security is considered at each stage of the development cycle. There are a lot of initiatives that have been put in place, to help drive this adoption. My two favorites, which are also big within the organization that I work for are:

  • Automation of Security Testing: Building and automating security testing within CI/CD pipelines greatly reduces the human effort needed to review each code change. DAST (dynamic application security testing), SAST (static application security testing), IAST (interactive application security testing), and even SCA (software composition analysis) scans are pivotal in modern-day security scanning within pipelines. In most cases, if any security defect is detected, that particular build fails, and the deployment of potentially vulnerable software is subsequently stopped.
  • Security Champions: This is by far my favorite practice. Perhaps it has different names in different organizations. However, training developers to have a security mindset has turned out to be highly beneficial, especially in helping to address the cybersecurity talent shortage problem.

In general, to implement DevSecOps, organizations often adopt a set of best practices that promote security throughout the software development lifecycle. Some key practices can be summarized as below:

Security as Code

Security should be integrated into the codebase and treated as code, with security policies and controls written as code and stored in version control. Code-driven configuration management tools like Puppet, Chef, and Ansible make it easy to set up standardized configurations across hundreds of servers. This is made possible by using common templates, minimizing the risk that bad actors can exploit one unpatched server. Further, this minimizes any differences between production, test, and development environments.

All of the configuration information for the managed environments is visible in a central repository and under version control. This means that when a vulnerability is reported in a software component, it is easy to identify which systems need to be patched. And it is easy to push patches out, too.

I was once part of a team that was tasked with ensuring cloud operational excellence. At the time, there was a problem with how virtual machines were spun up in the cloud environment. We needed a way to control which type and version of AMIs could be launched in the environment. Our solution was to create golden images according to the CIS Benchmark standards. We then implemented policies to restrict the creation of virtual machines to only our specific, golden Amazon Machine Images.

In the end, we had EC2 instances that adhered to best practices. To further lock things down, we had created Amazon CloudFormation templates, complete with which policies can be attached to the VMs. To achieve our goal, we leveraged these CloudFormation templates along with Service Control Policies (SCPs) to implement a standard and secure way of creating VMs. The point that I am trying to drive with this story is that code-driven configuration can be used to ensure automated, standard, and secure infrastructure deployment within environments.

Automated Security Testing

Automated security testing should be performed as part of the continuous integration and deployment (CI/CD) pipeline, with tests for vulnerabilities, code quality, and compliance. Using automated tools, it is often possible within an environment to set up and run automated pentests against a system as part of the automated test cycle.

During my stint as a cybersecurity engineer, I have been involved in setting up these tools within a pipeline whose sole purpose was to conduct automated fuzzing and scans on applications to search for any known OWAST Top 10 vulnerabilities. A popular open-source tool that can be used to achieve similar results is OWASP Zap.

At the time, this had two immediate results: The developers became more aware of the OWASP Top 10 vulnerabilities and actively tried to address them during development. On top of that, some of the routine and mundane tasks were taken off from the already stretched application security team. This is a simple example of just how automated testing can go a long way in addressing the software security challenges we currently face.

Continuous Monitoring

Continuous monitoring of applications and infrastructure is essential to detect and respond to security threats in real-time. A popular design pattern is the implementation of centralized logging within an environment. Logs from networks, infrastructure, and applications are collected into a single location for storage and analysis. This can provide teams with a consolidated view of all activity across the network, making it easier to detect, identify and respond to events proactively.

This article goes into great detail on some of the freely available solutions that are often used to implement centralized logging. Logging forms the foundation upon which metrics for monitoring can be set within an environment. The Target Breach from 2013 has often been used as a case study as to why proper investment in logging and monitoring is crucial.

Collaboration

Collaboration between development, operations, and security teams is critical to ensure security is considered throughout the development process. A lot of companies have embraced the agile way of work in a bid to give teams flexibility during software development, this further fosters collaboration within the teams.

DevSecOps is primarily intended to avoid slowing down the delivery pipeline. The DevSecOps methodology, being an evolution of DevOps, advocates for application developers, software release, and security teams to work together more efficiently, with more cooperation, in order to deliver applications with higher velocity while adhering to the security requirements. The end goal is to achieve delivery at scale with speed while ensuring collaboration and the free flow of ideas among the different involved teams.

Training and Awareness

Security training and awareness programs should be provided to all team members to ensure everyone understands their role in ensuring security. In my first job in cybersecurity, one of my key tasks was to implement a culture of secure coding within the organization. As a result, we decided to embark on massive training and awareness of the developers within the company. Our reasoning behind this decision was that if the simulated phishing tests carried out within most organizations usually work, then the same concept could be applied to the developers; train their eyes to spot the common problems.

This was quite the task, but in the end, we had two avenues to implement this. The first was to use commercial software that integrates with IDEs and scans code as the developers write, providing them with suggestions to address any security defects in the code. This was a huge success.

The second thing we did was implement regular training for the developers. Every fortnight, I would choose a topic, prepare some slides and perform a demo on how to leverage different security misconfigurations to compromise infrastructure. The two tools that were vital in achieving this were Portswigger WebSecurity Academy and Kontra security which provided a lot of practice on API and Web security misconfigurations.

At the time of my departure from the organization, this was a routine event and developers were more aware of certain common security misconfigurations. We also leveraged Capture the Flag events and provided some incentives to the winning teams to keep everyone motivated. This was one of the most successful initiatives I have undertaken in my career, and it was a win for both parties: the developers gained crucial knowledge, and the security team had some work taken off their plates.

Preventing Software Supply Chain Attacks

Any time I come across these words, my mind automatically goes back to the 2020 SolarWinds attack and the 2021 Log4j Vulnerability. Most of the people in the security world are quite familiar with these two, not only because of the damages they caused but also due to the fact that they struck right around the Christmas holidays! The conversations around software secure supply chain gained a lot of traction with these two incidents.

There was already much talk about the software supply chain, but there was little traction in terms of actually putting in some measures to address this problem. The whole frenzy that was caused by the Log4j vulnerability seems to have been the push that was needed for organizations to act. The US National Security Agency has been giving continuous advisories to developers and organizations on how to better address this problem. The problem with software supply chain compromise is not one that will be addressed overnight. To date, we still see reports of malicious Python or JavaScript packages, and even malicious applications finding their way to the Google Play Store.

I came across one of the simplest analogies describing software supply chains as I was reading blogs and attempting to gain more understanding about this topic, back in 2020. The author of the blog compared software supply chain attacks to the ancient kingdoms, where the enemy soldiers would poison the common water well, rendering every person in that village basically weak and unable to fight.

This is quite accurate with respect to supply chain attacks. Instead of an attacker compromising many different targets, they only have to taint the one common thing that many unknowing victims use. This makes an attacker’s work quite easy thereafter, and this is exactly what happened with SolarWinds, and later with the Log4j vulnerability. This is obviously extremely dangerous, in line with how software is currently made; there is a huge dependence on open-source libraries and packages.

This post by John P. Mello Jr. provides ten high-profile software supply chain attacks we can learn from, including that one staged on Okta in 2022. From the blog post, the compromise on both npm and Python Package Index (PyPI) alone affected over 700,000 customers. In this case, the attacker did not have to compromise each of the individual seven hundred thousand victims, they simply found a way to tamper with third-party software building components and enjoyed their loot. Like in the poisoned well analogy, the consequences are felt downstream by anyone who uses the compromised packages. Third-party risk assessment is now a whole thing in most organizations, however, this is still no defense for the poisoned well.

The big question is, how secure are these third-party libraries and packages? This is a topic that deserves its own article, there is simply so much to cover. With the problem highlighted, what are governments and private organizations doing to prevent these kinds of attacks from happening?

  • Zero Trust Security Architectures (ZTA): This concept encourages us to always assume a breach is present and act as if the attackers are already in our environment. Cisco Systems and Microsoft have some amazing products for implementing ZTA through passwordless authentication and continuous monitoring of authenticated users.
  • More organizations are opting to conduct regular third-party risk assessments. There have been cases in which certain organizations have been compromised by first compromising a less secure vendor or contractor, as was in the case of the famous Target breach.
  • Enhanced security in build and update infrastructure should definitely be on top of the list. If properly implemented, attackers would technically not be able to tamper with and deliver vulnerable software or software updates to the downstream clients.
  • Proper asset inventory, together with a comprehensive list of software bill of materials (SBOM). When Log4j hit back in 2021, many of us were in much bigger trouble because we had no idea where to look in our environments, as we didn’t have an accurate SBOM for the various applications we were running. It is crucial to maintain this inventory because it’s impossible to defend and investigate that which you do not know.

DevSecOps plays a huge role in ensuring secure software development, this is becoming more apparent every day. Ensuring that software build materials remain safe is also key to preventing most software supply chain attacks. It remains to be seen how the threat landscape evolves, for now, however, we must focus on getting the basics right.

As cyber threats continue to evolve, it is essential to integrate security into the software development process. DevSecOps is a culture shift that promotes collaboration, shared responsibility, and continuous improvement, with security integrated into every stage of the development process. By adopting DevSecOps best practices, organizations can build more secure software faster and reduce the risk of security breaches, while also improving collaboration and reducing costs.