If you work anywhere near the tech-fin industry, you will know that the end-of-year holidays are just around the corner. Surges in network traffic at high volumes are expected during the holiday season, putting strain on your services everywhere, from database connections to inter-service communications to transaction atomicity. Of course, you’re a diligent engineer, so you’re already preparing your business to withstand those heavy hits. Maybe you have a complex cluster of different applications running independently behind an API gateway, all intertwined with intricate critical paths.
So you gather your senior team and start sketching out measures you can take to get you safely through the holidays. Say you’ve concluded that the API gateway is a significant point of failure and have decided to scale this piece of software both vertically and horizontally, so it’ll run on a set of replicated instances operating on some beefed-up computing units. Not only that, but because it will be a financially critical period, you’ve decided to implement a feature freeze for the entire two-month season, even though you are an agile team with a robust CI/CD pipeline.
Winter is coming
So you have prepared yourself for battle. You have several resilience measures in place, extensive monitoring tools, and a dedicated squad on-call, ready to respond to any incident that may arise. You and your team are confident that your services will breeze through the traffic spree and, although you’re sure there will be some bumps along the way, everything should work out just fine. I mean, you’ve thoroughly analyzed every component involved, so what could go wrong, right?
When the time finally comes, all eyes are set on your metrics, looking out for spikes and inconsistencies. The first two weeks went by, along with an initial wave of heavy traffic. You notice some requests take longer than usual, but there is nothing to worry about. Everything looks good so far. Until it doesn’t! Just around Christmas, another surge of heavy traffic comes in. But this time, several alarms have been set off, and the on-call team gets paged in.
You put down your eggnog and rush to your monitoring dashboards just to see your super-sized API gateway cluster falling apart. One by one, the instances start going down, unable to take in any requests. You have rolling reboots configured, so your cluster manager starts cycling your instances over, but that takes some time, and a good twenty minutes go by until everything is stable again. You run through the numbers and see you’ve lost thousands of requests; most of them were financial transactions. You have to write a postmortem up, so you better forget about that eggnog.
Upon close inspection of your logs and metrics, you find there’s a memory leak most likely caused by this third-party dependency in your API gateway. This never happened before! Did someone manage to deploy a new release during the feature freeze period? Hmm… Not likely. No builds were run in the last month, so you scour through the commit history. To everyone’s surprise, this particular dependency was introduced nine months ago! How could this problem show up just now?
Complexity is a complex thing
While preparing your services for the shopping season, you’ve inspected each component separately, looking out for failure points where they connect directly to other parts of the system. These kinds of interactions can be easily modelled with unit and integration tests, so their expected behaviour can be safely guaranteed against errors. Hard hits to the gateway were dealt with by scaling this application up. You even went a step further and enforced a feature freeze so that no extra bugs could sneak in through your builds. All sensible measures were taken, along with deep discussions with an experienced team. So who is to blame?
You see, according to Lehman’s laws of software evolution, each feature implemented in your applications takes your codebase one step down the complexity hole, as more layers and components are added. Slowly your system outgrows its boundaries and starts interacting with other systems that have their own stack and their own set of complexities, essentially turning them into open systems, that is, systems that interact with other systems. One aspect of those open systems is the emergence of uncharted behaviour. And these may come out as something pretty obvious, but also in very subtle ways.
In the example above, your scaled-up API gateway could be hiding the memory leak in a continuous delivery context, where the applications are constantly being rebooted, effectively resetting their memory footprint. As soon as you feature-froze your pipeline, you created the ideal conditions for this bug to emerge. As counterintuitive as it may seem, the very measures taken to improve the application’s resilience were ultimately responsible for its failure.
Therein lies the subtleties of complex systems: their stress points are systemic, not easily detected. They’re the result of complex interactions that are heavily dependent upon initial conditions and have a high degree of state branching. So their behaviour is perceived as random. This chaotic interplay underlies every information exchange in complex structures. And it gets even more pronounced in cloud-based microservice architectures, so commonplace these days. So what can we do to gain more knowledge about those systemic points of tension? Well, embrace chaos as a disciplined practice!
Let there be chaos!
As you can see, there’s a direct relationship between software complexity and chaos, and this, in turn, can fog out systemic failures. Local optimizations and resiliency measures have little to no effect in preventing them. Even proper unit and integration tests can handle little more than single points of failure. To some extent, functional tests can catch systemic failures, but you need to know they exist in the first place. And there are so many ways that these can manifest themselves, with so many moving parts involved, it’s improbable that someone in your team can know and understand all of them in advance. So instead of trying to figure out all the different ways your system can fail, you should be failing it yourself, purposefully creating unstable conditions for it to crash. This may sound a bit silly at first, but you can only realize your system’s weaknesses by actually seeing them happening.
It turns out Chaos Engineering is a safe and structured way to inject instabilities in your system while extracting helpful knowledge about its inner complexity and deep behaviour and come up with action plans to improve its overall resiliency. After reading and discussing several definitions for it, we at Pismo came up with our own formal understanding of what constitutes Chaos Engineering as a practice as well as an organizational culture:
“Disciplined experimentation to uncover systemic weaknesses and build confidence in the system’s ability to withstand unstable conditions in production.”
The wording is important here, so I’ll go through the terms we used, their meanings in a systemic resilience context, and how each of them contributes to forming a set of principles and tenets that should guide our chaos practice and objectives.
“Disciplined experimentation…” — As you start experimenting with Chaos Engineering, it becomes easier to see it as a playground where you shut things down just for the fun of it. And that part is indeed essential in getting the hang of it, in a sandboxed environment, of course. But the real value comes when you perform these experiments in a structured manner, writing down your motivations and objectives and tracing a clear path as to how you’ll extract knowledge from them. What metrics should you use? What is the nature of those metrics? Are they infrastructure-related or business-oriented? How fast can your team respond to alarms? Are there any useful alarms at all? These are the kinds of questions that you should ask while planning out your chaos sessions – that will show you the way to the relevant actions you can take to improve your system’s safety and stability. Also, it is worth noting that Chaos Engineering is not a form of testing for critical use cases. What sets those two concepts apart are their objectives and means. Chaos experiments aim for failures instead of stability. Unlike formal tests, there are no expected outcomes since its primary purpose is to reveal new ways your system can fail, as measured from its borders.
“… to uncover systemic weaknesses…” — As said before, single points of failure and expected behaviours of direct dependencies are easily covered with automated unit and integration tests. But the focus here should be systemic failures – the ones that can’t be known in advance. By causing instabilities, like network latency, loss of communication with other services, or even server time shifts, we can make the inherent chaos surface, revealing subtle but potentially disruptive edge case scenarios.
“… and build confidence in…” — Ultimately, trust is what drives any business operations. So it’s crucial that we can rely on our system’s resilience, but also that we’re able to extend that trust to our customers as well. One of the goals of chaos experiments is to identify failures before they become outages. Each experiment iteration that verifies your assumption about your system’s behaviour increases your confidence in that hypothesis.
“… the system’s ability to withstand unstable conditions…” — To the disappointment of some, Chaos Engineering is not about creating chaos. As fun as it seems to pull the wires randomly and see the rising of the Chaos Kingdom from the depths of your systems, this approach is counterproductive in learning new things about its reliability. It turns out that chaos lies at the other end of the experimentation process. The overall idea is to forcefully create unstable conditions that make the inherent chaos visible, shedding light on systemic points of failure that would otherwise only be detected as incidents.
“… in production.” — Chaos Engineering lets you gain an insight into the system you’re experimenting with. Your different environments might look the same at the infrastructure level. But there are undoubtedly important systemic distinctions between them – like network access patterns, payload content, database load, monitoring tools, incident readiness, and many others. These distinctions play decisive roles individually and ultimately affect the entire system. You can always run those experiments in a sandboxed environment, with no actual workload involved, and there is value in those. But ultimately, the outages we’re trying to prevent will happen in production. And unless you start running those experiments in a productive environment, you won’t be building enough confidence in this context.
Is it for me?
Chaos Engineering relies on scientific methods to design and build structures and processes as with any engineering discipline. So there is one piece that is fundamental to the whole puzzle: you must have metrics that express the current state of your system with a reasonable degree of detail.
“Without observability, you don’t have chaos engineering. You just have chaos.”
Charity Majors, CEO and co-founder of Honeycomb
If you’re serious about assessing the reliability of your system, you need numbers and evidence to back your assumptions and identify outlying behaviour. As a bonus, you can also use those metrics to set OKRs for safety and resilience. That is the first requirement you should meet while considering Chaos Engineering as an organizational practice.
Then you need to evaluate the complexity of your systems and teams. Can you afford dedicated squads to maintain application infrastructure and operate intricate monitoring tools? Do you have a solid development and deployment pipeline with a steady stream of tasks and builds? Are your services complex enough to justify the investment? Since chaos is an inherent part of every open system, you may benefit from Chaos Engineering practice if you have a reasonably complex organization.
At Pismo, we’re just beginning our journey into those benefits. We’ve come up with an implementation path where each step incrementally adds more services and team interactions to (hopefully) learn more about their complexities and slowly build confidence in the process itself along the way. So we’re starting with single-team chaos sessions in a sandboxed environment. Still, we plan to evolve them into game days and automated sessions and ultimately take the whole process into production. Gremlin was chosen as our reliability-as-a-service provider since it is pretty much in line with the principles mentioned above and our vision of chaos practice as structured experimentation that allows for both methodology and freedom.
“If the result confirms the hypothesis, then you’ve made a measurement. If the result is contrary to the hypothesis, then you’ve made a discovery.”
Enrico Fermi, Italian physicist
The whole chaos engineering framework is built around the need of organizations to navigate their software complexities, so the process must support and foster the discovery of systemic responses and deep behaviour. Its canonical method suggests four steps, regardless of our particularities as a team.
Set a steady state for the service while planning the experiments. This should be based on data collected from real-world metrics and KPIs and reflect its regular operation.
Build a hypothesis around this steady-state, describing the expected behaviour under stressful conditions. This will create a clean slate against which we can check the experiment’s results.
Inject instabilities in the system. Ideally, they should reflect scenarios one can expect from real-world usage but assume nothing. For instance, we used to think that cloud services are always available, but this is not always the case. Remember that managed services and components connections all boil down to inter-process communications and are susceptible to noise and failures.
Minimize the blast radius that is the reach of the injected failures over our services. Not only does this minimize the impact the experiments have on our users, but it also gives us a better signal-to-noise ratio, allowing unexpected behaviour to stand out from the steady state.
With these objectives in mind, we’ve developed a Chaos Session Planning document to be filled by every squad willing to assess the resiliency of their services. Currently, this should contain a description, along with a diagram, of the critical path we’re verifying, where we can evaluate possible stress points and dependencies we can start with the experiments. The corresponding metrics and dashboards should be listed together with an Emergency Plan, with metrics thresholds beyond which the experiment will be considered too disruptive and should be aborted.
Each of the experiments needs to be described in detail, and a hypothesis should be clearly stated. The resulting measures are specified against this hypothesis, asserting if it was confirmed or not and if the failures were detected with the proposed monitoring strategy. This document was proven to be an invaluable tool to help squads come up with actionable plans to improve the reliability of their services and reveal unknown dependencies between different levels of the system, from code to external components.
When designing experiments, we don’t aim for known behaviour. Instead, we try to develop novel ways to stress the system so hidden failures can emerge. We found out that modelling past incidents was a good way to inspect systemic weaknesses deeply without the pressures of incident responses. This helps build enough peace of mind for us to enjoy a good night’s sleep. And a Merry Christmas, when the time comes!