Evolution of Chaos Engineering: From Shift-Right to Run-Everywhere
When chaos engineering started gaining ground a few years back as an essential practice to ensure reliability, it brought about a change in the mindset of traditional software practitioners in forcing them to think about “shifting-right”, i.e., testing in production. While this did not intend to replace traditional “shift-left” approaches for testing, it was more about the need for finding out application behavior amidst the vagaries that only a prod environment could offer. And chaos engineering, with its scientific hypothesis-based approach and blast radius principles to control “the extent of chaos”, was (is) seen as the way to go about it. After all, testing in production is not only about fault-injections. It helps unearth more issues than just application misbehavior - around deployment practices, observability mechanisms, incident response, recovery practices, etc., As the SRE role has grown, chaos engineering also has grown as an integral function. Accelerated in no small part due to organizations like Netflix & Amazon publicly sharing their stories and technologies.
However, the emergence of the cloud-native paradigm and an ever-increasing adoption of Kubernetes has brought with it the challenges of re-architecting applications: to be distributed, loosely-coupled (read: microservices) and containerized. It has also brought in new ways of dealing with “operational” aspects such as deployments and upgrades, storage (in case of stateful components), maintaining HA, scaling, recovery, etc., Kubernetes, as the overarching orchestration system provides multiple choices and approaches for implementing these. All this contributes to significant complexity. And with it a lot of apprehensions.
Organizations that are consciously adopting cloud-native design principles and migrating to Kubernetes as the deployment platform of choice typically face the need to test different kinds of failure scenarios (including those of the Kube ecosystem components themselves) and learn about their application behavior and deployment practices deeply, by hypothesizing & repeated experimentation. And it is often needed to build confidence before they hit production. In recent times, this has contributed to a “shift-left” approach in chaos engineering. With more and more organizations planning for chaos as part of their software delivery process. By that, it means reliability verification as a practice and end-goal, and by extension, chaos experimentation, is now no longer an only-SRE or Ops responsibility. Developers are getting involved too (Note that this doesn’t in any way undermine or replace testing in prod, that is still the ultimate goal of CE)
How is Chaos as part of Continuous Delivery different from traditional Failure Testing
One question that arises here is, how is it different from failure testing one would expect the QA teams (or developers wearing a QA hat) to perform. The differences are subtle and it depends on the persona as to how the chaos tooling is employed. Some prominent ones are:
With chaos engineering, a lot of emphases is placed on the “what”, i.e., service level objectives (SLOs) over the “how”, i.e., application functionality.
Chaos is expected to be done against systems mimicking production (typically called “staging”) environments. Kubernetes helps here: today it is the de-facto development platform as much as it is a deployment platform of choice, which makes it easier to achieve some degree of similarity with prod (via the right mix of scale, simulated traffic, and in cases where permissible - datasets cloned from prod)
The focus is more about observation and inference rather than a pre-defined/strict “validation”. Having said this, the boundaries are a bit blurred here, and typically “chaos efforts” end up being a mix of both, with a nuanced practice seen to get developed as the organization matures in its journey.
One of the immediate impacts of this culture is the practice of integrating chaos into continuous delivery (CD) pipelines, with a dedicated stage for running experiments against a suitable pre-prod environment, and the results/findings from the experiment deciding on the promotion of the change (i.e., build artifact - often container images or a deployment/resource specification) into production. The process of gauging these findings is typically around consuming data from different service level indicators (often metrics from sources like Prometheus) and examining it against a predefined SLO.
In this blog, we introduce you to a means of implementing the above, i.e., executing chaos in CD pipelines with the ability to “gate” your deployments to production based on checks against SLOs. This we achieve using the cloud-native control plane for CD: Keptn & LitmusChaos.
(You can read more about Litmus here)
Keptn: Cloud-Native Application Life-Cycle Orchestration
Keptn is an open-source enterprise-grade cloud-native application life-cycle orchestration tool. Keptn orchestrates continuous delivery and operations of your applications and was built to help create an “Autonomous Cloud”, which essentially means enabling your organization to become more autonomous when deploying and operating your apps & services in the new multi-hybrid cloud platforms. Something that is termed “No-Ops” in the broader community. It can integrate with monitoring and observability platforms thanks to its event-driven architecture, with the communication being achieved using CloudEvents (itself an incubating CNCF project).
A major use case of Keptn within Kubernetes involves defining a pipeline with one or more stages for deployment, testing, and remediation strategies. A sequence in Keptn typically starts with the deployment of an application into one of the pre-prod namespaces (the images or spec of this deployment can be the artifact of a CI process, with Keptn integrating into popular CI frameworks like Jenkins or Travis CI), followed by other stages that can trigger tests and evaluate the operational characteristics of the application before subjecting it to a quality-gate evaluation, as part of which predefined SLOs (essentially rules created against SLIs derived from Prometheus, Dynatrace, or other sources) are validated. The success of this eval results in “promoting” an application to the next phase, say, deploy into production. Keptn also supports remediation based on certain observations in production, and provides flexibility in being installed/used for a specific use-case (CD/QualityGating), etc.,
At the core of Keptn’s philosophy is GitOps & Code/Config generation (it stores all of its configurations in an internal Git repository that can be connected with Github/other Git-based SCM, and applies changes from the Git repository to your environment). All of Keptn’s stages and artifacts (right from applications, core services to SLOs) powering the pipeline are git-controlled and a lot of complexity is abstracted away from users via simple CLI operations (it provides APIs too) that generate the config specs.
Litmus Service in Keptn
The litmus world has been witnessing a steady increase in the number of “chaos left-shift” use cases. This also led us to share ways of introducing litmus chaos experiments into Gitlab & spinnaker pipelines. As members of a community working on a framework that is (a) built grounds-up to be GitOps friendly and (b) has a conscience aligned with “principles of chaos”, we were committed to finding a means to bring in the “hypothesis” & “SLO” elements into “pipeline-led” chaos. Litmus Probes feature was one of the first results of this introspection and today it provides a way to define “expectations” out of your infrastructure or application under chaos, with the diversity of definitions ranging from service availability to metrics & K8s resource states.
It was during this time that we learned about the Keptn project. We were immediately impressed by its capabilities and especially its well-defined outlook on “Quality Gating”. While probes are limited in their scope to individual experiments, there was a need for evaluating SLOs at a broader level, across “periods” of chaotic activity. And by this, we mean the period in which applications are subjected to real-world load and real-world failures (one or more, run in sequence or in parallel). Keptn’s event-driven infrastructure provides just that. It evaluates SLOs (described by a simple and easy to understand configuration file) for the period of time the “test”, in this case, the real world load, is run while also allowing for chaos injections to happen in the background to simulate the real-world failures. All the control data being maintained as cloudevents makes it convenient to visualize & consume - with the Keptn Bridge (dashboard) offering a useful view of how each event has played out & displaying the evaluation results.
Another feature worth mentioning here is that the Keptn control plane runs a single dedicated service for each integration, that acts on all pipelines being executed in the system. This brings about a low-touch approach to managing pipelines & abstracts away the complexities of “integrating” new chaos tests.
Considering all this & with excellent support from the Keptn team, we wrote up a litmus service that can now help to inject “chaos” into any stage of a Keptn pipeline. Functionality-wise, it acts on the “deployment finished” cloudevent and triggers chaos on the app deployed in a pre-prod namespace, while at the same time, out-of-the-box stress tools generate a user-defined load profile against it. Leading us to the insights and benefits discussed earlier.
Trying out the Litmus Service with a Demo Pipeline On Keptn
Once we had the integration working, we showcased it to the community via a webinar to share our learnings and also to encourage the community in trying it out! During this presentation, we demonstrated a simple Keptn pipeline that attempts to verify the resiliency of a “hello-service” using Litmus.
You can find the presentation here: https://docs.google.com/presentation/d/1ZFEwXqFIkpicM5-aRkLinWh8AWjombkg1d0Y-Z1DC68/edit#slide=id.p1
In part-2 of this blog series, we will discuss how you can reproduce the demo steps and along the way revisit & appreciate concepts that we have discussed in this article.
Stay tuned!!
Are you an SRE, developer, or a Kubernetes enthusiast? Does Chaos Engineering excite you? Join our community on Slack For detailed discussions & regular updates On Chaos Engineering For Kubernetes.
Check out the LitmusChaos GitHub repo and do share your feedback. Submit a pull request if you identify any necessary changes.