LitmusChaos: Node Memory Hog Experiment

Udit Gaurav - Apr 9 '21 - - Dev Community

In this blog, I will be discussing the LitmusChaos experiment that is used to exhaust the available memory resource on a Kubernetes node called Node Memory Hog. The main focus will be on the memory consumption and monitoring techniques used for the experiment. Before moving forward for those who are new to litmus and wanted to explore more about litmus features or chaos engineering in general, I would recommend checking out the litmus blog first also Litmus provides a chaos operator, a large set of chaos experiments on its hub, detailed documentation, and a friendly community do checkout them.

What you’ll need

For running this experiment you should have:

What it’ll break

The Node Memory Hog experiment causes Memory resource exhaustion on the Kubernetes node which may lead to the eviction or unusual behavior of the application running on the cluster depending upon the extent up to which we exhaust the memory resource of the application node. The experiment uses a stress-ng tool to inject memory chaos.

The extent up to which we want to consume the node memory resource can be provided in two modes (from litmus 1.12.2):

  • In the percentage of the total memory capacity of the Node.
  • In the Mebibytes unit to consume available memory.

I'll demonstrate the experiment in both modes in upcoming sections. In both cases, the experiment consumes a set amount of memory, or as much as is available (whichever is lower), and holds onto it for the duration of the chaos.

Get Ready To Induce Memory Chaos On Kubernetes Node

Infrastructure

  • I'll be using a three-node GKE cluster of type e2-standard-2 (having 2 vCPU and 8 GB Memory).
root@cloudshell:$ kubectl get nodes
NAME                                       STATUS   ROLES    AGE    VERSION
gke-cluster-1-default-pool-3020340d-3vns   Ready    <none>   111s   v1.17.14-gke.400
gke-cluster-1-default-pool-3020340d-jfg6   Ready    <none>   111s   v1.17.14-gke.400
gke-cluster-1-default-pool-3020340d-k0lv   Ready    <none>   111s   v1.17.14-gke.400
Enter fullscreen mode Exit fullscreen mode

Monitor the Node Memory resource

We can monitor the node memory resource consumption:

  • By ssh into the target node and run htop(if not available then install and run) OR

  • By Creating a pod on the target node having htop container for monitoring.

htop is an interactive system-monitor and process viewer tool for Unix:
We will use the second way to monitor the host memory resource. We'll not get the process information responsible for eating the memory resource in this way but we can have a watch over the current(consumed/total) memory. Follow the steps to setup the monitoring pod:

Create the monitoring file:

vim myhtop.yaml
Enter fullscreen mode Exit fullscreen mode

Add the following to the myhtop.yaml

---
apiVersion: v1
kind: Pod
metadata:
  name: myhtop
spec:
  containers:
  - name: myhtop
    image: litmuschaos/go-runner:ci
    imagePullPolicy: Always
    command: ['sh', '-c', 'sleep 3600']
  ## Replace this with the target node name...
  nodeName: gke-cluster-1-default-pool-3020340d-3vns
Enter fullscreen mode Exit fullscreen mode

Create the myhtop pod:

root@cloudshell:$ kubectl apply -f myhtop.yaml
pod/myhtop created
Enter fullscreen mode Exit fullscreen mode


NOTE: To monitor more than one node you can use DaemonSet or Deployment.

Now exec into the pod and run htop command:

kubectl exec -it myhtop -- /bin/sh
/litmus $ htop
Enter fullscreen mode Exit fullscreen mode

The Output will look like this:

Screenshot from 2021-01-29 20-10-43

The memory row in the output is representing that the node size is 7.77G out of which 502M is in use.

Deploy an Nginx application as AUT

We'll now deploy the Nginx application as Application Under Test.

kubectl apply -f https://raw.githubusercontent.com/litmuschaos/chaos-ci-lib/master/app/nginx.yml
Enter fullscreen mode Exit fullscreen mode

This will deploy an Nginx application with three replicas. It is Optional to deploy an application for running a node memory experiment we do this to check the application behavior(its availability)under chaos. A general observation after deploying the application is a slight increase in memory usage from 502M(to 518M) as one replica of the application pod was scheduled on the target node. For heavy application, it will be definitely more.

Screenshot from 2021-01-29 20-56-37

Setup Litmus and run chaos

Install litmus to run the experiment if not already installed.

kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-latest.yaml
Enter fullscreen mode Exit fullscreen mode

For running chaos follow the instructions from experiment docs and create the CRs using hub. This will include the creation of RBAC, experiment, and engine(with appinfo and memory consumption values). When all the resources are created and the experiment starts executing in a traditional litmus way a runner pod will get created that will create the experiment pod for chaos execution.

root@cloudshell:$ kubectl get pods
NAME                           READY   STATUS    RESTARTS   AGE
myhtop                         1/1     Running   1          61m
nginx-756f79d98d-47d5f         1/1     Running   0          47m
nginx-756f79d98d-q48dv         1/1     Running   0          47m
nginx-756f79d98d-sqrfz         1/1     Running   0          47m
nginx-chaos-runner             1/1     Running   0          31s
node-memory-hog-47m4k0-pz9tx   1/1     Running   0          27s
node-memory-hog-bdvnde         1/1     Running   0          12s
Enter fullscreen mode Exit fullscreen mode

Run and Monitor Node Memory hog Chaos

As we discussed in previous sections that we can run the node memory hog experiment in two different modes.

In Percentage Mode

  • By using the Percentage of total memory resource capacity of the node. In this mode, we use MEMORY_CONSUMPTION_PERCENTAGE env in chaosengine to provide the input percentage example 50(without percentage symbol) for 50% of total node capacity which is 7.77G. After running the chaos we can monitor the memory consumption from htop terminal.

Expectation: The 50% of total memory(7.77G) that is 3.885G Memory should be consumed.

Screenshot from 2021-01-29 21-11-30

Observation: We observe that the memory consumption is in between 3.56G to 3.93G. It is more fluctuating when compared with Mebibytes mode of execution.

In Mebibytes Mode

  • By using the Mebibytes unit to consume available memory. In this mode, we use MEMORY_CONSUMPTION_MEBIBYTES env in chaosengine to provide the input in Mebibytes example 3500(without Mebibytes symbol) for consuming nearly 3.5G of available node memory. After running the chaos we can monitor the memory consumption from htop terminal.

Expectation: The 3.5G of available memory to be consumed apart of 518M that was already in use. So the total amount of memory to be consumed should be nearly 4.0G.

Screenshot from 2021-01-29 21-47-10

Observation: From the above output we can observe that the memory consumed is 3.94G which is near to 4.0G. This mode of execution is less fluctuating and more accurate than the percentage mode.

Analysing The Memory Spike On Node

Now let us try to analyze the Memory Spike over the target nodes as the experiment takes memory from the process, simulating a massive memory burst. This could cause extreme memory demands to see how the API would identify and rectify such an issue. For analyzing the experiment behavior let's plot a memory hog graph which tells us about the different parameters like active memory bytes, free and available memory bytes, cpu usage, and disk read time.

Screenshot from 2021-02-22 03-41-55
In the above graph, the highlighted area marks the area under chaos. We can note the moment we start the experiment the free and available memory bytes spike up which represents that more amount of available or free bytes are consumed during the chaos and a very limited amount of memory is left to use. Similarly, the active memory and average free along with cache memory drop down which indicates the limited space left on the nodes.

Now let us try to max-out the memory consumption by the Kubernetes nodes and analyze the real memory consumption using htop, free -m, and kubectl top(metrics-server) commands.

We’ll re-run the experiment to check the upper limit of the memory consumption by changing the MEMORY_CONSUMPTION_PERCENTAGE value to consume 100 % of memory (that is 7.77G) in the chaosengine. To rerun the experiment we just need to create the chaosengine again with new inputs this time. And check the output using kubectl top (metrics-server), htop, free -m.

kubectl top Vs htop Vs free -m

kubectl top (metrics-server):

The kubectl top command returns current CPU and memory usage for a cluster’s pods or nodes. kubectl top pod uses a memory working set, the output of the kubectl top will be similar to the value of the metric "container_memory_working_set_bytes" in Prometheus.
If we run this query in Prometheus:

container_memory_working_set_bytes{pod_name=~"<pod-name>", container_name=~"<container-name>", container_name!="POD"}
Enter fullscreen mode Exit fullscreen mode

you will get value in bytes that almost matches the output of kubectl top nodes.

We observe from the latest experiment run along with 100% of memory consumption, the memory usage output from kubectl top reached a maximum up to 92% and fluctuated between 70-90%.

Screenshot from 2021-04-09 16-26-38
Screenshot from 2021-04-09 16-27-02

htop:

htop is similar or alternative to the top command that is much easier to use for normal tasks. It is a cross-platform interactive process viewer.

Screenshot from 2021-04-09 16-26-20
Screenshot from 2021-04-09 16-26-42

The amount of memory consumption will raise to 6G.

free -m:

free is another useful command in Linux to get a detailed report on the system’s memory usage. When used without any option, the free command will display information about the memory and swap in kibibyte. 1 kibibyte (KiB) is 1024 byte. To get the vales in megabytes we need to use free -m.

Screenshot from 2021-04-09 16-26-35

Observation:

  • The experiment will not consume/hog the memory greater than the total memory available on Node in other words there will always be an upper limit for the amount of memory to be consumed equal to the total available memory.

  • There is more fluctuation in the memory consumption when we provide the amount of memory to be consumed in percentage while providing the memory in mebibytes will be less fluctuating but we need to make sure that we should not exceed the available memory otherwise our experiment helpe pod will get evicted.

What happens when both modes are provided

There are cases when both ENVs are provided or non of them are provided. Then priority will be given according to the following table:

MEMORY_CONSUMPTION_PERCENTAGE MEMORY_CONSUMPTION_MEBIBYTES Priority
Defined Not Defined Percentage
Defined Defined Percentage
Not Defined Defined Memory in Mebibytes
Not Defined Not Defined Default with 30%

WARNING: When a large amount of available memory is consumed like 95% or 100% of available memory then some services may stop working for a certain duration and the node may get crashed and go into a NotReady state.

Conclusion

You have installed litmus on your Kubernetes cluster and learned to execute LitmusChaos node memory hog experiments in different modes and at different values. You also configure a monitoring pod on the node using htop unix command. You also came across the different cases of node memory hog execution and its validation. If you haven't tried the experiment yet this is the best time for you to start and don't forget to give your opinion and share your feedback/experience with litmus in the comments below.


Are you an SRE or a Kubernetes enthusiast? Does Chaos Engineering excite you?
Join Our Community On Slack For Detailed Discussion, Feedback & Regular Updates On Chaos Engineering For Kubernetes: https://kubernetes.slack.com/messages/CNXNB0ZTN
(#litmus channel on the Kubernetes workspace)
Check out the Litmus Chaos GitHub repo and do share your feedback: https://github.com/litmuschaos/litmus
Submit a pull request if you identify any necessary changes.

GitHub logo litmuschaos / litmus

Litmus helps SREs and developers practice chaos engineering in a Cloud-native way. Chaos experiments are published at the ChaosHub (https://hub.litmuschaos.io). Community notes is at https://hackmd.io/a4Zu_sH4TZGeih-xCimi3Q

LitmusChaos

Open Source Chaos Engineering Platform

Slack Channel GitHub Workflow Docker Pulls GitHub stars GitHub issues Twitter Follow OpenSSF Best Practices FOSSA Status YouTube Channel



Read this in other languages.

🇰🇷 🇨🇳 🇧🇷 🇮🇳

Overview

LitmusChaos is an open source Chaos Engineering platform that enables teams to identify weaknesses & potential outages in infrastructures by inducing chaos tests in a controlled way. Developers & SREs can practice Chaos Engineering with LitmusChaos as it is easy to use, based on modern Chaos Engineering principles & community collaborated. It is 100% open source & a CNCF project.

LitmusChaos takes a cloud-native approach to create, manage and monitor chaos. The platform itself runs as a set of microservices and uses Kubernetes custom resources (CRs) to define the chaos intent, as well as the steady state hypothesis.

At a high-level, Litmus comprises of:

  • Chaos Control Plane: A centralized chaos management tool called chaos-center, which helps construct, schedule and visualize Litmus chaos workflows
  • Chaos Execution Plane Services: Made up of a…
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .