Chaos engineering is the intentional and controlled causing of failures in the production or pre-production environment to understand their impact and plan a better defence posture and incident maintenance strategy. Every day creates a new opportunity for an organization’s critical application or infrastructure to fail, potentially threatening its ability to deliver services to customers.
Netflix learned the concept of chaos engineering first-hand when it switched from on-premises to the cloud. They experienced an outage that led to a three-day interruption to service delivery in 2008. This outage predates its transformation as a video streaming operation, which would have made that outage exponentially more costly. As a result, Netflix decided that it would do everything possible to minimize disruptions and it began to introduce chaos engineering into its workflows.
Causes of failure can vary between several issues, including security breaches, misconfigurations or service disruptions. The likelihood of errors or disruptions can rise as more applications and data are hosted in the cloud, which can create an increase in security issues.
Netflix created Chaos Monkey, an open-source tool that creates random incidents in IT services and infrastructure meant to identify weaknesses that can be fixed or addressed through automatic recovery procedures. They implemented Chaos Monkey when it moved from a private data centre to Amazon Web Services (AWS) in response to unreliability from the cloud.
Many organizations now use Chaos Monkey, and Gremlin to run their chaos engineering experiments. Chaos engineering is an important defence against infrastructure failures, outages, or missing components in an organization’s production environment.
Chaos engineering experiments follow a structured three-step process:
So why would any company break things on purpose? Exposing system flaws is necessary to make it more robust. Chaos engineering can help you avoid outages and other disruptions. By identifying potential failure points and correcting them before they cause problems, you can proactively prevent disruptions. In addition, chaos engineering provides several customer, business, and technical benefits. The main benefit is allowing companies to create stronger products that will impact their bottom line and meet customer expectations.
Even a small issue in code can have a catastrophic effect on the overall production environment given different program dependencies. For instance, an error in the transaction software system for a financial services firm can result in the loss of millions of dollars.
Organizations might be unable to avoid all IT incidents, but they can minimize the damage by using chaos management to understand likely scenarios and their best possible solutions.
Chaos engineering helps site reliability engineers (SREs) and other members of the DevOps team to provide continuous delivery of services by avoiding significant disruptions to their service. It helps them understand their vulnerabilities better and informs how to minimize the impact if a disruption occurs. Chaos engineering is not a random process where engineers terminate instances or services or otherwise cause systems to fail without any purpose. This process identifies potential future issues, allowing engineering teams to solve problems proactively and avoid them in the live environment further down the road.