Modern software-based services are implemented as large scale, highly distributed systems running in cloud or data centers. Disruptive real -world events like hardware failures or software bugs can create turbulent conditions in the environments where these systems and can lead to unpredictable outcomes. Chaos Engineering is a study of system’s ability to withstand such disruptive turbulent conditions. It works by purposefully injecting failure into the production environment that mirrors the actual failure modes and monitors the recovery.
Chaos engineering uses experimentation to study effects of such disruptions. These experiments typically start by defining “steady state” of the system and come up with metrics that can be used to measure this steady state. Then various events that mirror the failure modes (aka “Chaos”) that are possible in our production environment (e.g. server crash), are injected systematically in the system in controlled environment.
Effect of the injected “Chaos” is observed by collecting and analyzing the metrics identified above. If the system is able to recover successfully, this builds confidence in system’s ability to handle an actual unplanned outage. If a failure to recover is observed, then it becomes a target for improvement before that behavior manifests in the system at large. By automating these chaos experiments, it is possible to identify several such vulnerabilities on a continual basis.
This webinar goes into details of what Chaos Engineering is, why is it important, and how to use it to build immunity in Production Systems.
It also emphasizes that extensive monitoring & logging is essential for the success of Chaos Engineering in its goal to improve the resiliency of the system.