In the old days, things were simpler. Computer systems were running in the server room in the office building, hopefully the server was carefully sized, unix operating systems ran processes, and when a server was overloaded, processes waited in the runqueue.
In todays world, this is different. Computer systems are running in the cloud, mostly sizing means you can scale in the cloud, and the linux operating system is running processes and (lots of) threads.
Mind the absence of 'runqueue'. Linux does not have an explicit runqueue. Instead, when a process or a thread needs to run, it is set to running, after which it is the task of the task scheduler (I say 'task scheduler', there also is an IO scheduler) to assign a process or thread a CPU time slice (sometimes called 'quantum'). So all of waiting to run and actual runtime are covered by the process status 'running'.
This is leading to a lot of confusion and misunderstanding, because it's not clear when a system is just busy or really overloaded.
CPU usage?
You might think: but how about just looking at the amount of CPU used? CPU used time from /proc
, used by lots of tools, shows how the time is spent from what the operating system considers to be a CPU, it does not say anything about anything else than what is running on CPU, such as processes or threads waiting to get running.
load
Another thought might be: this is old news: we got the load per 1, 5 and 15 minutes. This indeed does tell more than CPU usage, which cannot tell anything outside of what the CPU spent its time on, but the load figure is has serious problems.
First of all it's a exponentially-damped sum of a five second average. In lay-mans terms: if activity changes, the load figure start moving towards it, so it takes time to reflect the actual change in activity, and it's not steady, it will always be moving towards it.
Second, linux load does NOT actually indicate CPU load only, but rather has become a more general, non CPU specific indicator of activity of threads on a system. (see: https://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html)
So at best, a load might be used as a starting point, but it cannot be used to understand absolute usage of CPU.
If you are still having doubts about my statement about the load figure, take a look at the kernel source of loadavg.c and see what the kernel maintainers opinion about it is.
so are we lost?
At this point you might think: in this way you just told me that my common tools cannot be used to say anything specifically about CPU business? Yes: on linux, with the common figures being CPU time and the load figures, essential details missing to understand absolute CPU load.
But luckily, there is a solution! Since linux kernel version 2.6.9 (!), linux exposes scheduler statistics. These statistics do describe activity of the scheduler, and some of them are really concrete: the total runtime on CPU and the total waiting (willing to run) time! However, there is a problem: none of the current standard tools do use and show these statistics.
solution
This begged for a solution. The company I work for, YugabyteDB, uses node_exporter to gather operating system statistics. Node_exporter is a very common utility for exposing operating system statistics.
I decided to create a tool to take the scheduler statistics for cpu run-time and willing to run-time from the node exporter endpoint, and show that along with the number of running processes and blocked processes, CPU usage percentages and the load figures: nodetop.
If you want to try it:
- Install rust: https://www.rust-lang.org/tools/install
- You might need to install OS dependencies (gcc, openssl-level).
- Clone the project:
git clone https://github.com/fritshoogland-yugabyte/nodetop.git
- Build the utility:
cd nodetop
cargo build --release
- Use the utility:
$ ./target/release/nodetop -h localhost
hostname r b | id% us% sy% io% ni% ir% si% st% | gu% gn% | scd rt scd wt | l 1 l 5 l 15
localhost:9300 5 0 | NaN NaN NaN NaN NaN NaN NaN NaN | NaN NaN | 0.000 0.000 | 0.040 0.060 0.060
localhost:9300 3 0 | 99 0 0 0 0 0 0 0 | 0 0 | 0.016 0.008 | 0.030 0.060 0.060
localhost:9300 5 0 | 100 0 0 0 0 0 0 0 | 0 0 | 0.008 0.001 | 0.030 0.060 0.060
localhost:9300 4 0 | 99 0 0 0 0 0 0 0 | 0 0 | 0.015 0.008 | 0.030 0.060 0.060
This shows nodetop measuring the machine I am running on.
- r/b: running and blocked figures from /proc/vmstat: the total number of processes in the the state running, and the total number of processes in the state uninterruptible sleep.
- id%/us%/sy%/io%/ni%/ir%/si%/st%/gu%/gn%: percentage of CPU time in the modes: idle, user, system, iowait, nice, irq, softirq, steal, guest user, guest nice.
- scd rt/wt: scheduler statistics (on cpu) runtime and waiting for runtime.
- l 1/5/15: load figure for 1/5/15 minutes.
- Please mind you can monitor multiple machines at the same time putting a comma-separated list with '-h'/'--hosts'.
- Please mind it currently uses port 9300 by default, which is not the node_exporter default port (9100). You can set it to another port using '-p'/'--ports', or multiple ports if you like or need.
If you look at the figures, you notice:
- there is a constant number of running processes (r): this is normal, and doesn't reflect how much work (on CPU) these processes or threads do.
- there is some time in 'scd rt': some work is done in the measurement window (5 seconds).
- there is some time in 'scd wt': yes, there always will be some latency between getting runnable, and being scheduled to run, because these are two different things, and thus always add some waiting time.
Let's add some (simplistic) load to the machine. This can easily be done without any specialistic tools using the 'yes' command:
$ yes > /dev/null &
Now look at the output:
localhost:9300 4 0 | 100 0 0 0 0 0 0 0 | 0 0 | 0.008 0.002 | 0.010 0.020 0.050
localhost:9300 5 0 | 62 38 1 0 0 0 0 0 | 0 0 | 0.770 0.024 | 0.090 0.040 0.050
localhost:9300 5 0 | 49 50 0 0 0 0 0 0 | 0 0 | 1.010 0.022 | 0.170 0.050 0.060
localhost:9300 7 0 | 49 50 0 0 0 0 0 0 | 0 0 | 1.021 0.051 | 0.230 0.070 0.060
localhost:9300 4 0 | 50 50 0 0 0 0 0 0 | 0 0 | 1.006 0.021 | 0.290 0.080 0.070
localhost:9300 6 0 | 45 52 3 0 0 0 0 0 | 0 0 | 1.093 0.290 | 0.350 0.100 0.070
localhost:9300 5 0 | 49 50 0 0 0 0 0 0 | 0 0 | 0.986 0.023 | 0.400 0.110 0.080
localhost:9300 5 0 | 49 50 0 0 0 0 0 0 | 0 0 | 1.040 0.044 | 0.450 0.130 0.080
The first line is without the 'yes' load generator, it was started during the measurements of the second line, and was in effect starting from the third line.
The CPU percentages now reflect the change in machine activity: idleness decreased to 50%, and user time increased to 50%. This is a two CPU system. The 'scd rt' runtime becomes roughly 1, which is 1 second per second runtime, and the 'scd wt' wait time slightly increases because of the added activity. More activity gives less option to schedule a task at will.
This very well illustrates the slowness of the load figures to react: the quickest figure (load 1/one minute load) moves with approximately 0.050 (seconds) per 5 seconds here.
Now lets add a second processs generating load (simply execute yes > /dev/null &
again):
localhost:9300 5 0 | 50 50 0 0 0 0 0 0 | 0 0 | 0.835 0.018 | 0.340 0.090 0.130
localhost:9300 5 0 | 49 50 0 0 0 0 0 0 | 0 0 | 1.015 0.042 | 0.400 0.110 0.140
localhost:9300 7 0 | 1 98 0 0 0 0 0 0 | 0 0 | 2.142 0.340 | 0.520 0.140 0.150
localhost:9300 8 0 | 0 100 0 0 0 0 0 0 | 0 0 | 2.009 0.415 | 0.640 0.170 0.160
localhost:9300 8 0 | 0 100 0 0 0 0 0 0 | 0 0 | 1.999 0.723 | 0.750 0.200 0.170
localhost:9300 9 0 | 0 98 2 0 0 0 0 0 | 0 0 | 1.996 1.980 | 1.170 0.300 0.200
localhost:9300 6 0 | 0 100 0 0 0 0 0 0 | 0 0 | 1.987 0.632 | 1.240 0.330 0.210
Adding a second CPU load generator makes the idleness percentage to be reduced to 0%, and user percentage increased to 100%. Because this load still can be executed by the CPU, the CPU figures here in this case show the full picture.
The scheduler runtime also moves roughly to 2 seconds per second, which is the number of CPUs in this system. The scheduler waiting time now gets way higher to roughly 0.5 seconds per second, and is 'jumpy'. The reason for that is that outside of the two processes running on the two CPUs, there is the kernel and all the other linux processes running, which need some time, and the scheduler cannot schedule a task at will, it will have to choose a task based on priority and push another one off CPU.
The load figure, which did have the time yet to get to 1, now with 2 load generators starts moving quicker. But again: the linux load figure cannot, by definition, show the current accurate activity, even if it wouldn't be polluted with measuring other things than CPU activity.
Please mind that with 2 loads fully loading the CPU, we can see by looking at the scheduler waiting for runtime figure that in fact this host is already overloaded. For the available runtime of 2 seconds per second, the waiting time is over half a second. However, you cannot take the figure scheduler waiting for runtime and declare a figure approaching 1 to be overloaded: with a higher number of CPUs, there will be (slightly) more waiting to get on CPU.
Now let's add a third load generator. The third load generator is beyond this system's capacity, because it has 2 CPUs:
localhost:9300 5 0 | 0 100 0 0 0 0 0 0 | 0 0 | 2.003 0.740 | 3.870 2.330 1.170
localhost:9300 7 0 | 0 100 0 0 0 0 0 0 | 0 0 | 2.007 0.393 | 3.720 2.320 1.180
localhost:9300 5 0 | 0 100 0 0 0 0 0 0 | 0 0 | 2.003 0.847 | 3.750 2.350 1.190
localhost:9300 7 0 | 0 100 0 0 0 0 0 0 | 0 0 | 1.983 1.319 | 3.770 2.380 1.210
localhost:9300 7 0 | 0 100 0 0 0 0 0 0 | 0 0 | 2.013 1.693 | 3.940 2.440 1.230
localhost:9300 10 0 | 0 100 0 0 0 0 0 0 | 0 0 | 1.994 2.140 | 3.870 2.440 1.240
localhost:9300 10 0 | 0 100 0 0 0 0 0 0 | 0 0 | 1.999 2.170 | 3.960 2.490 1.260
At the fourth line, the third load generator is added. The CPU figures are flat out running on CPU even before we added the third load generator, so that can't be used.
The CPU load generator addition can very well be seen by looking at the scheduler waiting for runtime figure: it jumps from ~0.7 seconds per second to over 2 seconds per second. This is another property of queueing for random unordered processes: this gets inefficient doesn't add linear time, but slightly more.
The 1-minute load average is reasonably indicative here, because the generated loads are (very) steady, and there are no other types of load influencing the system (network, disk IO). The problem is that on a real system, the exact type of the load often is not easy to determine.
conclusion
The traditional tools for understanding the CPU activity: any tool that exposes CPU times, such as sar, and the load figure are weak indicators of actual load and for establishing a system being overloaded on CPU.
Using the the scheduler waiting for runtime statistic, a very precise determination can be made to see if there are too many processes or threads willing to get running, and thus getting queued.
ps. While researching the scheduler statistics, I found that Tanel Poder already written about it. This takes a different approach, and measures scheduler statistics for a single process. Every linux task has its individual scheduler statistics too.
You can kill all backgrounded (&
) tasks for a session using:
kill -9 $(jobs -p)