3 key steps for running chaos engineering experiments

IT Articles

by Frank 4 Views 0

Chaos engineering is the practice of running thoughtful, planned experiments that teach us how our systems behave in the face of failure. Given the trends around dynamic cloud environments and the rise of microservices, the web continues to grow increasingly complex alongside our dependency on these systems. Making sure failures are mitigated and proactively deterred is more important now than ever.

Even brief issues can hurt customer experience and impact a company’s bottom line. The cost of downtime is becoming a major KPI for engineering teams, and when there’s a major outage the cost can be devastating. In 2017, 98 percent of ITIC surveyed organizations said a single hour of downtime would cost their business over $100,000. One major outage could cost a single company millions of dollars. The CEO of British Airways recently revealed that a technological failure that stranded tens of thousands of British Airways passengers in May 2017 cost the company 80 million pounds ($102.19 million USD).

This is why companies who proactively prepare for these scenarios will be much better off than those who wait for the next incident. Below are three key steps for running effective chaos engineering experiments within your organization. Start with a single host, container, or microservice in your test environment. Then try to crash several of them. Once you’ve hit 100 percent in your test environment, reset to the smallest bit possible in production and take it from there. 

Chaos engineering step #1: Plan an experiment

One of the most powerful questions in chaos engineering is “What could go wrong?” Start with forming a hypothesis about how a system should behave when it becomes under stress. By thinking about your services and environments upfront, you can better prioritize which scenarios are more likely (or more frightening) and should be tested first. By sitting down as a team and whiteboarding your services, dependencies, and data stores, you can also formulate some worst case scenarios.