This previous post introduced LitmusChaos as a cloud-native chaos engineering framework, that provides both, native off-the-shelf chaos experiments as well as the ability to orchestrate experiments written in the BYOC (bring-your-own-chaos) mode. You may also have tried your hand with this quick litmus demo. Exciting as it already is, we have seen one more usage pattern evolve in the Litmus community: Chaos Workflows. Does this sound like some word-play between two popular dev(git)ops practices? Let me explain in detail.
Is it sufficient to just inject a failure?
One of the common reasons for injecting chaos (or, as it is commonly known: running a chaos experiment) in a microservices environment is to validate oneโs hypothesis about system behavior in an unexpected failure. Today, this is a well-established practice with a multitude of chaos injection tools built for the container (read: Kubernetes) ecosystem, enabling SREs to verify resilience in the pre-production and production environments.
However, when simulating real-world failures via chaos injection on development/staging environments as part of a left-shifted, continuous validation strategy, it is preferable to construct potential failure sequence or chaos workflow over executing standalone chaos injection actions. Often, this translates into failures during a certain workload condition (such as, say, percentage load), multiple (parallel) failures of dependent & independent services, failures under (already) degraded infrastructure, etc. The observations and inferences from these exercises are invaluable in determining the overall resilience of the applications/microservices under question.
LitmusChaos + Argo = Chaos Workflows
While this is already practiced in some form, manually, by developers & SREs via gamedays and similar methodologies, there is a need to automate this, thereby enabling repetition of these complex workflows with different variables (maybe a product fix, a change to deployment environment, etc.). One of the early adopters of the Litmus project, Intuit, used the container-native workflow engine, Argo, to execute their chaos experiments (in BYOC mode via chaostoolkit) orchestrated by LitmusChaos to achieve precisely this. The community recognized this as an extremely useful pattern, thereby giving rise to Chaos Workflows.
Using Chaos Workflows as an aid for benchmark tests
In this blog, let's look at one use-case of chaos workflows. We shall examine how chaos impacts an Nginx server's performance characteristics using a workflow that executes a standard benchmark job with pod-kill chaos operation in parallel.
Prepare the Chaos Environment
In the next few sections, we shall lay the base for executing this workflow by setting up the infrastructure components.
Install Argo Workflow Infrastructure
The Argo workflow infrastructure consists of the Argo workflow CRDs, Workflow Controller, associated RBAC & Argo CLI. The steps are shown below to install Argo in the standard cluster-wide mode, where the workflow controller operates on all namespaces. Ensure that you have the right permission to be able to create the said resources.
Create argo namespace
root@demo:~/chaos-workflows# kubectl create ns argo
namespace/argo created
Create the CRDs, workflow controller deployment with associated RBAC
root@demo:~/chaos-workflows# kubectl apply -f https://raw.githubusercontent.com/argoproj/argo/stable/manifests/install.yaml -n argo
customresourcedefinition.apiextensions.k8s.io/clusterworkflowtemplates.argoproj.io created
customresourcedefinition.apiextensions.k8s.io/cronworkflows.argoproj.io created
customresourcedefinition.apiextensions.k8s.io/workflows.argoproj.io created
customresourcedefinition.apiextensions.k8s.io/workflowtemplates.argoproj.io created
serviceaccount/argo created
serviceaccount/argo-server created
role.rbac.authorization.k8s.io/argo-role created
clusterrole.rbac.authorization.k8s.io/argo-aggregate-to-admin configured
clusterrole.rbac.authorization.k8s.io/argo-aggregate-to-edit configured
clusterrole.rbac.authorization.k8s.io/argo-aggregate-to-view configured
clusterrole.rbac.authorization.k8s.io/argo-cluster-role configured
clusterrole.rbac.authorization.k8s.io/argo-server-cluster-role configured
rolebinding.rbac.authorization.k8s.io/argo-binding created
clusterrolebinding.rbac.authorization.k8s.io/argo-binding unchanged
clusterrolebinding.rbac.authorization.k8s.io/argo-server-binding unchanged
configmap/workflow-controller-configmap created
service/argo-server created
service/workflow-controller-metrics created
deployment.apps/argo-server created
deployment.apps/workflow-controller created
Install the argo CLI on the test harness machine (where the kubeconfig is available)
Create the service account and associated RBAC, which will be used by the Argo workflow controller to execute the actions specified in the workflow. In our case, this corresponds to the launch of the Nginx benchmark job and creating the chaosengine to trigger the pod-delete chaos action. In our example, we place it in the namespace where the litmus chaos resources reside, i.e., litmus.
root@demo:~# kubectl apply -f https://raw.githubusercontent.com/litmuschaos/chaos-workflows/master/Argo/argo-access.yaml -n litmus
serviceaccount/argo-chaos created
clusterrole.rbac.authorization.k8s.io/chaos-cluster-role created
clusterrolebinding.rbac.authorization.k8s.io/chaos-cluster-role-binding created
Nginx traffic characteristics during a non-chaotic benchmark run
Before proceeding with the chaos workflows, let us first look at how the benchmark run performs under normal circumstances & what are the properties of note.
To achieve this:
Let us run a simple Kubernetes job that internally executes an apache-bench test on the Nginx service with a standard input of 10000000 requests over a 300s period.
root@demo:~# kubectl create -f https://raw.githubusercontent.com/litmuschaos/chaos-workflows/master/App/nginx-bench.yaml
job.batch/nginx-bench-c9m42 created
Observe the output post the 5 min duration & note the failed request count. Usually, it is 0, i.e., there was no disruption in Nginx traffic.
root@demo:~# kubectl logs -f nginx-bench-zq689-6mnrm
2020/06/23 01:42:29 Running: ab -r -c10 -t300 -n 10000000 http://nginx.default.svc.cluster.local:80/
2020/06/23 01:47:35 This is ApacheBench, Version 2.3 <$Revision: 1706008 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking nginx.default.svc.cluster.local (be patient)
Finished 808584 requests
Server Software: nginx/1.19.0
Server Hostname: nginx.default.svc.cluster.local
Server Port: 80
Document Path: /
Document Length: 612 bytes
Concurrency Level: 10
Time taken for tests: 300.001 seconds
Complete requests: 808584
Failed requests: 0
Total transferred: 683259395 bytes
HTML transferred: 494857692 bytes
Requests per second: 2695.27 [#/sec] (mean)
Time per request: 3.710 [ms] (mean)
Time per request: 0.371 [ms] (mean, across all concurrent requests)
Transfer rate: 2224.14 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 1 0.7 0 25
Processing: 0 3 2.0 3 28
Waiting: 0 3 1.9 2 28
Total: 0 4 2.2 3 33
WARNING: The median and mean for the initial connection time are not within a normal deviation
These results are probably not that reliable.
Percentage of the requests served within a certain time (ms)
50% 3
66% 4
75% 5
80% 5
90% 7
95% 8
98% 9
99% 11
100% 33 (longest request)
Formulating a Hypothesis
Typically, in most production deployments, the Nginx service is set up to guarantee specific SLAs in terms of tolerated errors, etc., While, say, under normal circumstances, the server performs as expected, it is also necessary to gauge how much degradation is seen for different levels of failures & what the cascading impact may be on others applications. The results obtained by inducing chaos may give us an idea on how best to manage the deployment (improved high availability configuration, resources allocated, replica counts, etc.,) to continue to meet the SLA despite a certain degree of failure (while that is an interesting topic to discuss for another day, we shall restrict the scope of this blog to demonstrating how workflows can be used!)
In the next step, we shall execute a chaos workflow that runs the same benchmark job while a random pod-delete (Nginx replica failure) occurs and observe the degradation in the attributes we have noted: failed_requests.
Create the Chaos Workflow
Applying the workflow manifest performs the following actions in parallel:
Starts an Nginx benchmark job for the specified duration (300s)
Triggers a random pod-kill of the Nginx replica by creating the chaosengine CR. Cleans up after chaos.
You can visualize the progress of the chaos workflow via the Argo UI. Convert the argo-server service to type NodePort & view the dashboard at https://<node-ip>:<nodeport>
Observing the Nginx benchmark results over 300s with a single random pod kill shows an increased count of failed requests.
root@demo:~# kubectl logs -f nginx-bench-7pnvv
2020/06/23 07:00:34 Running: ab -r -c10 -t300 -n 10000000 http://nginx.default.svc.cluster.local:80/
2020/06/23 07:05:37 This is ApacheBench, Version 2.3 <$Revision: 1706008 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking nginx.default.svc.cluster.local (be patient)
Finished 802719 requests
Server Software: nginx/1.19.0
Server Hostname: nginx.default.svc.cluster.local
Server Port: 80
Document Path: /
Document Length: 612 bytes
Concurrency Level: 10
Time taken for tests: 300.000 seconds
Complete requests: 802719
Failed requests: 866
(Connect: 0, Receive: 289, Length: 289, Exceptions: 288)
Total transferred: 678053350 bytes
HTML transferred: 491087160 bytes
Requests per second: 2675.73 [#/sec] (mean)
Time per request: 3.737 [ms] (mean)
Time per request: 0.374 [ms] (mean, across all concurrent requests)
Transfer rate: 2207.20 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 11.3 0 3044
Processing: 0 3 57.2 3 16198
Waiting: 0 3 54.2 2 16198
Total: 0 4 58.3 3 16199
Percentage of the requests served within a certain time (ms)
50% 3
66% 4
75% 4
80% 5
90% 6
95% 7
98% 9
99% 11
100% 16199 (longest request)
Further iterations of these tests with increased pod-kill instances over the benchmark period or an increased kill count (i.e., number of replicas killed at a time) can give more insights about the behavior of the service, in turn leading us to the mitigation procedures.
Note: To test with different variables, edit the ChaosEngine spec in the workflow manifest before re-submission.
Conclusion
You can use Argo with LitmusChaos to construct complex chaos workflows, with pre-conditioning & dependencies built-in. The parallel nature of execution can help you simulate multi-service/component failures to verify application behavior under worst-case scenarios. You can even sew in recovery procedures based on error conditions.
Do try this out & let us know what kind of workflows you would like to see being built within litmus!
Are you an SRE or a Kubernetes enthusiast? Does Chaos Engineering excite you?
Join Our Community On Slack For Detailed Discussion, Feedback & Regular Updates On Chaos Engineering For Kubernetes: https://kubernetes.slack.com/messages/CNXNB0ZTN
(#litmus channel on the Kubernetes workspace)
Check out the Litmus Chaos GitHub repo and do share your feedback: https://github.com/litmuschaos/litmus
Submit a pull request if you identify any necessary changes.
Litmus helps SREs and developers practice chaos engineering in a Cloud-native way. Chaos experiments are published at the ChaosHub (https://hub.litmuschaos.io). Community notes is at https://hackmd.io/a4Zu_sH4TZGeih-xCimi3Q
LitmusChaos is an open source Chaos Engineering platform that enables teams to identify weaknesses & potential outages in infrastructures by
inducing chaos tests in a controlled way. Developers & SREs can practice Chaos Engineering with LitmusChaos as it is easy to use, based on modern
Chaos Engineering principles & community collaborated. It is 100% open source & a CNCF project.
LitmusChaos takes a cloud-native approach to create, manage and monitor chaos. The platform itself runs as a set of microservices and uses Kubernetes
custom resources (CRs) to define the chaos intent, as well as the steady state hypothesis.
At a high-level, Litmus comprises of:
Chaos Control Plane: A centralized chaos management tool called chaos-center, which helps construct, schedule and visualize Litmus chaos workflows