When talking about chaos engineering, experimenting is the watchword for Mauricio Galdieri, Software Architect at Pismo.
Mauricio was recently featured at Break Things on Purpose, a podcast held by Gremlin – the software company that provides the eponymous chaos engineering tool – to talk about how this methodology brings reliability and resiliency to the Pismo financial services platform.
Mauricio has been working at Pismo for almost five years. He is one of the people in charge of developing new technologies and applying them to the Pismo platform. “We do lots of documentation and proofs of concept. We try new things and talk to people from other companies to learn how they’re solving problems that we also have,” he explains.
Innovation in financial services
Banks, some of the most important Pismo clients, are usually conservative when adopting new technologies. According to Mauricio, security concerns are the centre of this mindset.
“We’re bringing these new technologies to the table. We want to prove to them that it is possible to run a banking system a hundred per cent in the cloud while maintaining high security and compliance with the industry standards.”
To help Pismo offer the most reliable and resilient banking and payments solution, Mauricio’s team uses chaos engineering experiments to learn how the platform behaves when infrastructure fails. During these experiments, the engineering team makes some components fail on purpose to assure that the system could go on operating despite the failures.
Testing versus experimenting
Mauricio explains that the most crucial point when explaining chaos engineering is that it’s not about testing something. It’s about experimenting with something. It’s a subtle but significant distinction between those concepts.
“When you test, you’re testing for something you knew would happen. You have an idea of how it should behave. Chaos engineering is more about experimenting. It’s designed for the unknowns. It’s like a lab. You’re trying different stuff to see what happens. You have an idea of what should happen – we call this a hypothesis. But you’re not sure if that is how it will behave,” he says.
According to Mauricio, chaos experiments always bring valuable results. It doesn’t matter whether these results confirm the team’s expectations or not. Even if things don’t behave as expected, the engineers learn about the system and how to improve its resiliency.
Chaos Engineering at Pismo
About a year and a half ago, there was an outage on one of Pismo’s major cloud providers. The failure affected digital banks in Brazil and services like Slack, Datadog, and other cloud-based systems.
“We were caught off guard because we didn’t have many options to ensure the system wouldn’t go down if something bad happened. So, we started thinking about ways to experiment with those major outages and see how we could still operate at least partially”, he explains.
Since then, Mauricio and his team have started working on chaos engineering, researching tools, and looking for partnerships and ways they could experiment with the Pismo platform. “We started working with chaos engineering trying to do one more thing to maximise our chances of success. We partnered up with Gremlin and came up with a way to bring structure to this process”, he says.
Running a chaos session
Mauricio says they set up a couple of steps to make chaos experiments successful at Pismo. One is to produce a document describing what they’re going to test – and why they’re doing that.
“What are the technologies used? Why did it fail in the first place? The fails I expect to see are failures or just unknown behaviours? This is an opportunity to consider whether some business rules are correct, if we should make them more flexible or change that business logic.”
The way to resilience
Chaos engineering is one of the techniques we use to build a highly resilient computing platform. According to Mauricio, systems should never work in unexpected ways. When talking about machines, every failure should be foreseeable. Computer systems sometimes just behave in ways we didn’t want them to.
“Success should be measured as a performance variability. Success is the flip side of failure.
To turn this scale towards success, you should create an environment that enables it. This could mean, for example, improving tooling or changing the organisational culture – all you do in your company to maximise your chances of success.”
Listen to this episode of Break Things on Purpose: