Elevating a system’s reliability involves several steps. Daniela Binatti, Co-Founder and CTO at Pismo, explored them in detail at the AWS Financial Services Cloud Symposium in São Paulo, Brazil.
In a speech called “How Pismo elevated its platform resiliency on AWS and its importance to the finance industry,” she presented the topic with Paulo Aragão, Field CTO and Principal Solutions Architect at AWS.
The Pismo platform is a feature-rich set of building blocks allowing banks and fintech companies to develop various financial products and services quickly. It is currently used for core banking, card issuing, payment processing, lending, digital wallets, and asset management applications.
Since Pismo is a global company, its platform has a distributed infrastructure spreading to several countries. Its microservices-based architecture makes it very flexible but requires intricate communication between APIs.
Moreover, operations are performed with ephemeral resources: “In the public cloud, resources are born and die all the time. The application has to be ready for this. And when a transaction fails, it is retried. The system needs to recognise retries, so it doesn’t repeat a transaction it has already processed,” says Daniela.
All this contributes to making it challenging to guarantee the platform’s resiliency. Daniela told the audience that since its inception, Pismo has worked on five main aspects to improve reliability:
- Build a resilient and fault-tolerant cloud infrastructure
- Implement appropriate recovery strategies for failures
- Adopt several testing approaches
- Use chaos engineering to uncover weak spots and fix them
- Put in place a whole basket of security measures to keep the platform protected
1. Infrastructure as code
Pismo’s infrastructure, built on AWS, is 100% configured as code using the Terraform tool. “When we close a contract with a customer in a country where we do not yet operate, if AWS is present there, just a few clicks are enough to activate the necessary infrastructure and launch a new platform operation,” says Daniela.
Our infrastructure spans several regions and includes automatically replicated databases (AWS Aurora and DynamoDB) with active/active architecture. Application clusters also spread over several regions.
Moreover, Pismo employs CI/CD (Continuous Integration/Continuous Distribution) and canary release techniques to ensure new software versions are smoothly distributed across regions.
The result is that even if an entire region goes down, which is unlikely, the platform can continue to operate with little disturbance.
2. Recovery strategies
“This industry is heavily reliant on legacy systems. We use the ISO 8583 protocol and Visa and Mastercard standards. And many integrations still happen by transferring files,” says Daniela.
Batch processes are a familiar failure spot in traditional banking systems, making it difficult to monitor ongoing operations. A failure in these processes usually impacts many accounts. Pismo eliminates this weak spot by avoiding batch processing. “We created a batchless platform. We turn all incoming files into granular events.”
With an event-driven architecture, the Pismo platform performs granular real-time processing. Every operation is monitored for integrity and can be interrupted and rolled back if needed. In addition, the applications can perform a graceful shutdown in case of failure.
3. The time to test is every time
“We do 300 to 400 production deployments per month, but customers don’t even notice. And we have tests built into the entire CI/CD pipeline,” Daniela says.
These tests are automatic. Some of them are unit tests, which validate the unit logic and the integrity of the internal interface. Other integrated tests check the functionalities in all service use cases and ensure regressions.
Pismo uses the SonarQube and Chekmarxs tools to manage the software development and testing pipelines. “These tools help us ensure that a new software version won’t bring new weaknesses to the platform.”
4. Let there be chaos
Chaos engineering is essential to maintaining high reliability in a complex system like the Pismo platform. We regularly perform chaos experiments to gain confidence in the system’s capacity to withstand turbulent conditions. We use the Gremlin tool to manage these experiments and automate them.
During the experiments, our engineers purposefully introduce failures in the infrastructure to assess and improve the system’s ability to cope with them. “We call these activities experiments, not tests. In a test, we have expectations about the results. In a chaos experiment, we don’t know what will happen.”
5. Security is vital
“Security is a topic that keeps me awake at night. An incident affecting availability is a terrible thing. But a security incident like a data leak can be catastrophic,” says Daniela.
“So we created a mindset that prioritises safety, with training that begins during the integration of a new employee. Everyone at Pismo knows that security vulnerabilities are the top priority. Fixing them is more important than any other task.”
Pismo has a “blue” security team which continuously improves defences while a “red” team attacks the platform to uncover vulnerabilities. It also embeds security in all phases of the software development life cycle. “We have an architecture security team working close to the product management area. They assure that a new product or feature will be secure from the inception.”
Learn more about next-gen banking platforms by downloading our whitepaper:
How to get the most out of a next-gen platform