Navigating safely through these stormy waters will ensure greater confidence in and resilience of the whole system. Here are a few pointers.
Although it is not new in industrial and manufacturing settings, chaos engineering is a relatively new discipline in digital engineering. It involves experimenting with software in production to better understand faults and build confidence in the system’s overall capability to withstand turbulence.
While chaos engineering principles have been gaining traction within the last years, clients and engineers are often (understandably) apprehensive because of the misconception that chaos engineering is all about deliberately breaking things. Additionally, the use of terms like “blast radius” or “random terminations” and references to “chaos” or “storms” (Facebook’s name for it) don’t exactly help soothe their concerns.
However, most of the engineers who have spent a significant amount of time unravelling problems that weren’t discovered earlier appreciate the ‘Shift Left’ approach and value the ability to perform tests and fix bugs as early as possible in the digital lifecycle.
So, when an organization unveils these issues earlier on in the lifecycle, that must mean a better quality of software and fewer late nights fixing unforeseen problems, right? If only that was true.
With the rise of more complex software, IoT, cloud, distributed systems, and microservices, a new approach to quality and resilience is required to account for the many permutations and interdependencies between all the constituent parts. This is where