LitmusChaos in 2021: The Year In Review

Karthik Satchitanand - Dec 29 '21 - - Dev Community

Introduction

Year-end retrospectives are an interesting topic. Much as it helps dwell on things accomplished, it also generates excitement (and nervous energy) about what is to come. In this short post (there is much to write individually on the topics referenced), I shall be making an attempt to encapsulate the progress LitmusChaos as a project & the litmus community as a whole made over the past year. Before we get there, my heartfelt thank you to all our users & adopters, contributors, as well as CNCF for helping us along in our journey. Your feedback, involvement, and mentorship are what keep us going.

I have tried to bucket some of the important milestones we achieved, based on their nature, and, in the process paint an overall picture of the growth of the project. These wouldn’t have been possible without valid criticism and healthy debates from/with the community.

Features & Releases

LitmusChaos 2.0

At the end of last year, we had begun to play with the idea of a “Litmus Portal”, essentially a dashboard that would help orchestrate chaos and simplify the learning curve in preparing chaos custom resources (CRs) for a given scenario. Over time, the scope of the portal went beyond this initial requirement to become a full-blown control plane that can:

  • Help create complex scenarios using workflows
  • Manage chaos across target environments (other clusters, namespaces)
  • Delegate chaos operations to teams
  • Leverage dedicated chaos artifact sources (you could use git to load/commit workflows from/into the chaos center)

This precipitated a major release (2.0) in August, after remaining in beta for nearly 6 months, a phase during which we learned more about user expectations around such capabilities. The decision to bump up the version was taken after we observed a collective acceptance and alignment of the community towards the newer way of approaching chaos compared to what was in use until then. Having said that, 2.0 is built on the same core, with users free to directly consume the chaos operator as before.

Support for Containerd, CRI-O

As the Kubernetes community moves on from docker as the container runtime, there was a growing need to support network, resource stress, and other experiments for other popular runtimes (growing usage of Litmus in OpenShift environments was another trigger). Today, Litmus natively supports choosing the desired runtime as an experiment tunable.

Better Blast Radius Control

As chaos adoption grew last year, so did the use-cases and requirements around how to make fault injection ever so granular. Especially on Kubernetes. While randomized failures are at the core of the traditional chaos engineering practice, it now operates against a heavily filtered dataset. Litmus added support for the percentage-based selection of pods, (satisfying the given namespace: workload label: node label constraints), nodes & cloud resources apart from the ability to pick out a chaos target by name.

Improved Hypothesis Validation via richer Probe Schema

Automated chaos experiments becoming more mainstream resulted in several enhancement requests around the basic probe functionality Litmus provided in earlier versions. Newer capabilities & thereby a richer schema has been added to HTTP, command, K8s & Prometheus probes with more in the works. They are being used for performing custom validations within experiments, making the experiment verdicts more meaningful.

Non-Kubernetes Experiments

Multiple organizations that adopted chaos engineering invariably had a mixed or hybrid environment, with services residing in different substrates - Kubernetes (self-hosted, managed/cloud), vanilla VMware VMs, or cloud instances, and in some cases bare-metal servers. When Litmus was picked by their teams (teams which are increasingly Kubernetes-native/aware) the ask from them was to enable the same platform and provide similar UX to perform chaos against non-Kubernetes targets to have a centralized view of chaos and resilience across different services. This resulted in an increased focus on experiments targeting VMs, disks on different cloud providers (VMware, Azure, GCP, AWS). This is an area that is a work in progress and should see more features/experiments come the new year.

Community

At the heart of any open-source project is the community, and Litmus witnessed great community growth in 2021. We added 18 new adopters, across cloud-native end-users, vendors, solution providers, and other open-source projects too. The end-of-month Saturday meetups saw more interest/attendance, the slack channel grew to include nearly five times what it was around this time last year. Contributions (of all types, code, docs, tests, helm charts) started coming in bigger numbers - leading to four newer maintainers (including folks from the adopting organizations). We partnered with members from the user community to deliver KubeCon talks featuring Litmus in the KubeCon EU ‘21 & KubeCon NA’ 21. We were pleasantly surprised to view presentations about it from other chaos advocates and see it featured in a Keynote that discussed reliability!

The year also saw the birth of a CNCF initiative to further the field of chaos engineering in a cloud-native environment: Chaos Engineering Working Group, of which litmus maintainers and community members are an important part.

Chaos Carnival

As the community grew, we realized it was also an indicator of the increasing adoption of chaos as a practice. Reliability is no longer an afterthought and while Litmus is playing an important role in this area, it isn’t the only one. Also, we felt a necessity to tap into the community to bring about more facets of chaos engineering and resilience, especially around aspects such as culture, processes, and allied technical topics such as observability, security et al. We decided to bring people to a common forum to discuss these and thereby spread knowledge to equip newer entrants to chaos engineering & strengthen practitioners with more data points. Thus was born the Chaos Carnival. Organized majorly by members from the Litmus Community, with generous help from sponsors, the event brought together some of the best minds in the SRE, DevOps & Chaos Engineering space to share their expertise.

Looking Ahead: What 2022 Brings

For starters, we are putting together the second edition of the chaos carnival with an awesome list of topics and speakers. We hope you will enjoy it!

Coming back to the project, increased support for diverse non-Kubernetes targets, simplified integrations with observability & CI/CD platforms remain a priority, not to mention more chaos types within Kubernetes. However, we also aim to place emphasis on security-related features (we got started last year by defining PSPs & Kyverno policies you could use with Litmus, apart from moving to more secure container images) that are an important enterprise need for adoption. Stay tuned to this space for more information on what these are!

Finally, we are working towards moving further (beyond sandbox) along the ladder within the CNCF ecosystem and making the community more vibrant - so that the litmus user can experience a richer chaoslib, better chaos orchestration & integrations, higher quality, faster patches/enhancements, and better support.

Thanks for reading this article!

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .