02 Fundamentals What Does Mean by Average Load

02 Fundamentals What Does Mean by Average Load #

Hello, I am Ni Pengfei.

Whenever we find that the system is slow, the first thing we usually do is to execute the top or uptime command to understand the system’s load situation. For example, like below, I entered the uptime command in the command line and the system immediately gave the result.

$ uptime
02:34:03 up 2 days, 20:14,  1 user,  load average: 0.63, 0.83, 0.88

But what I want to ask is, do you really understand the meanings of each column in the output?

I believe you are familiar with the preceding columns, which are the current time, system running time, and the number of logged-in users, respectively.

02:34:03              // Current time
up 2 days, 20:14      // System running time
1 user                // Number of logged-in users

As for the last three numbers, they represent the average load in the past 1 minute, 5 minutes, and 15 minutes, respectively.

“Average Load”? This term may be familiar yet unfamiliar to many people. We mention this term in our daily work, but do you truly understand the meaning behind it? If an intern in your team asks about it in great detail, can you explain what average load is?

In fact, 6 years ago, I encountered a scene similar to this. An intern from the company kept asking me what average load is, and I stumbled for a while but couldn’t explain it clearly in the end. Even though I see and use it all the time, why can’t I explain it? Later, I calmed down and thought that it was because I wasn’t knowledgeable enough.

So, over the years, whenever I encounter a problem, especially fundamental ones, I ask myself a few “whys” in order to fully understand the underlying principles of the phenomenon and be more flexible and confident in using it.

Today, I will teach you how to observe and understand this most common and important system indicator.

Some people might say, isn’t average load just the CPU usage within a unit of time? The 0.63 in the example means that the CPU usage is 63%. Actually, it’s not like that. If it’s convenient for you, you can execute the man uptime command to understand the detailed explanation of average load.

In simple terms, average load refers to the average number of processes in the runnable state and uninterruptible state within a unit of time, which is the average number of active processes. It doesn’t have a direct relationship with CPU usage. Let me explain these two terms, runnable state and uninterruptible state.

The so-called processes in the runnable state refer to processes that are using the CPU or waiting for the CPU, which are the processes we commonly see in the ps command, in the R state (Running or Runnable).

Processes in the uninterruptible state are processes that are in crucial kernel flows and cannot be interrupted. The most common example is waiting for I/O responses from hardware devices, which we see as the D state (Uninterruptible Sleep or Disk Sleep) in the ps command.

For example, when a process is reading or writing data to a disk, in order to ensure data consistency, it cannot be interrupted by other processes or interrupts until it receives a response from the disk. During this time, the process is in an uninterruptible state. If the process is interrupted at this time, it may result in inconsistent data between the disk and the process.

Therefore, the uninterruptible state is actually a protection mechanism of the system for processes and hardware devices.

So, you can simply understand that average load is actually the average number of active processes. The intuitive understanding of average number of active processes is the number of active processes within a unit of time, but it is actually the exponentially decaying average value of the active processes. You don’t have to worry about the detailed meaning of “exponentially decaying average,” which is just a faster calculation method of the system. You can simply treat it as the average value of the active processes.

Since the average is for the active processes, the most ideal situation is that each CPU is running exactly one process, so that each CPU is fully utilized. For example, what does it mean when the average load is 2?

  • On a system with only 2 CPUs, it means that all CPUs are fully occupied.

  • On a system with 4 CPUs, it means that there is 50% of CPU idle time.

  • And on a system with only 1 CPU, it means that half of the processes are competing for CPU time.

What is a reasonable average load? #

After explaining what an average load is, let’s go back to the original example. Can you determine, from the results of the uptime command, at what level the average load during those three time periods can be considered high, or at what level it can be considered low?

We know that the ideal average load is equal to the number of CPUs. So, when judging the average load, first you need to know how many CPUs the system has, which can be obtained using the top command or by reading the file /proc/cpuinfo, for example:

# Please consult the manuals or search the internet for the usage of grep and wc
$ grep 'model name' /proc/cpuinfo | wc -l
2

With the number of CPUs, we can determine that when the average load is larger than the number of CPUs, the system is overloaded.

However, a new question arises. In the example, we can see that there are three values for the average load. Which one should we refer to?

Actually, all of them should be considered. The three average values for different time intervals provide data sources for analyzing the trend of system load, allowing us to understand the current load situation more comprehensively and holistically.

To give an analogy, it’s like the weather in Beijing in early autumn. If you only look at the temperature at noon, you might think it’s still mid-summer in July. But if you combine the temperatures at morning, noon, and evening, you can basically have a comprehensive understanding of the weather conditions for that day.

Similarly, the three load time periods of the CPU mentioned earlier are based on the same principle.

  • If the values for 1 minute, 5 minutes, and 15 minutes are roughly the same or not significantly different, it indicates that the system load is stable.

  • But if the value for 1 minute is much smaller than the value for 15 minutes, it means that the load in the recent 1 minute is decreasing, while there has been a high load in the past 15 minutes.

  • On the contrary, if the value for 1 minute is much larger than the value for 15 minutes, it indicates that the load in the recent 1 minute is increasing. This increase could be temporary or it could continue to increase, so continuous observation is needed. Once the average load for 1 minute approaches or exceeds the number of CPUs, it means that the system is experiencing an overload issue, and the cause of the problem needs to be analyzed and optimized.

Here’s another example: let’s say we see an average load of 1.73, 0.60, 7.98 on a single CPU system. This means that in the past 1 minute, the system has been overloaded by 73%, while in the past 15 minutes, it has been overloaded by 698%. Looking at the overall trend, the system load is decreasing.

So, in a real production environment, at what level should we pay special attention to the average load?

In my opinion, when the average load exceeds 70% of the number of CPUs, you should analyze and investigate the high load issue. Once the load becomes too high, it may cause processes to respond slowly and affect the normal functioning of services.

However, the 70% figure is not absolute. The most recommended approach is to monitor the average load of the system and then analyze the trend of the load based on more historical data. When a significant increase in load is noticed, such as a doubling of the load, further analysis and investigation should be conducted.

Average Load and CPU Usage #

In practical work, it’s easy to confuse average load and CPU usage, so here I will make a distinction.

You may wonder, since average load represents the number of active processes, does a high average load mean high CPU usage?

We need to go back to the meaning of average load. Average load refers to the number of processes in the runnable and uninterruptible states within a unit of time. Therefore, it includes not only the processes currently using the CPU, but also the processes waiting for the CPU and waiting for I/O.

On the other hand, CPU usage is a statistical measurement of the CPU’s busy time within a unit of time, and it doesn’t necessarily correspond exactly to the average load. For example:

  • CPU-intensive processes, which use a lot of CPU, will cause the average load to increase, and in this case, the two are consistent;

  • I/O-intensive processes, which are waiting for I/O operations, can also cause the average load to increase, but CPU usage may not be very high;

  • A large number of processes waiting for the CPU to be scheduled can also cause the average load to increase, and in this case, CPU usage will be relatively high.

Average Load Analysis #

Next, we will look at three scenarios using three examples, and use tools such as iostat, mpstat, and pidstat to identify the root cause of the high average load.

Because the case analysis is based on machine operations, it’s better to follow along and actually perform the operations instead of just listening and watching.

Your Preparation #

The following examples are all based on Ubuntu 18.04, and they also apply to other Linux systems. The environment for the examples I’m using is as follows:

  • Machine configuration: 2 CPUs, 8GB memory.

  • Pre-install the stress and sysstat packages, for example, apt install stress sysstat.

Here, I will briefly introduce stress and sysstat.

stress is a Linux system stress testing tool that we use to simulate scenarios where the average load increases due to abnormal processes.

sysstat contains commonly used Linux performance tools used to monitor and analyze system performance. We will use two commands from this package, mpstat and pidstat, in our examples.

  • mpstat is a commonly used multi-core CPU performance analysis tool that is used to view real-time performance metrics for each CPU and the average metrics for all CPUs.

  • pidstat is a commonly used process performance analysis tool that is used to view real-time performance metrics for processes such as CPU, memory, I/O, and context switches.

In addition, each scenario requires you to open three terminals and log in to the same Linux machine.

Before the experiment, make sure you have completed the preparation mentioned above. If there are any issues with package installation, you can try to solve them by searching on Google first. If you still can’t solve them, you can leave a message in the comment section; this should not be difficult to resolve.

Also note that all the commands below are run as the root user by default. So if you are logged in to the system as a regular user, be sure to run the sudo su root command to switch to the root user.

If you have completed the above requirements, you can use the uptime command to see the average load before the test:

$ uptime
...,  load average: 0.11, 0.15, 0.09

Scenario 1: CPU-Intensive Processes #

Firstly, we run the stress command in the first terminal to simulate a scenario where the CPU usage rate is 100%:

$ stress --cpu 1 --timeout 600

Next, in the second terminal, we run uptime to see the change in average load:

# -d parameter indicates highlighting the changed area
$ watch -d uptime
...,  load average: 1.00, 0.75, 0.39

Finally, in the third terminal, we run mpstat to see the change in CPU usage:

# -P ALL indicates monitoring all CPUs, and the number 5 afterwards indicates outputting a set of data every 5 seconds
$ mpstat -P ALL 5
Linux 4.15.0 (ubuntu) 09/22/18 _x86_64_ (2 CPU)
13:30:06     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
13:30:11     all   50.05    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   49.95
13:30:11       0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
13:30:11       1  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00

From terminal 2, we can see that the average load slowly increases to 1.00 over 1 minute. From terminal 3, we can still see that one CPU has a usage rate of 100%, but its iowait is 0. This indicates that the increase in average load is caused by the CPU usage of 100%.

So, which process is causing the CPU usage to reach 100%? You can use pidstat to query:

# Output a set of data every 5 seconds
$ pidstat -u 5 1
13:37:07      UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
13:37:12        0      2962  100.00    0.00    0.00    0.00  100.00     1  stress

It is obvious from here that the CPU usage of the stress process is 100%.

Scenario 2: I/O intensive process #

First, run the stress command to simulate I/O pressure, which means continuously executing sync:

$ stress -i 1 --timeout 600

In the second terminal, run uptime to check the changes in average load:

$ watch -d uptime
...,  load average: 1.06, 0.58, 0.37

Then, in the third terminal, run mpstat to observe the changes in CPU usage:

# Displaying the metrics of all CPUs and outputting a set of data every 5 seconds
$ mpstat -P ALL 5 1
Linux 4.15.0 (ubuntu)     09/22/18     _x86_64_    (2 CPU)
13:41:28     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
13:41:33     all    0.21    0.00   12.07   32.67    0.00    0.21    0.00    0.00    0.00   54.84
13:41:33       0    0.43    0.00   23.87   67.53    0.00    0.43    0.00    0.00    0.00    7.74
13:41:33       1    0.00    0.00    0.81    0.20    0.00    0.00    0.00    0.00    0.00   98.99

From here, we can see that the average load over one minute slowly increases to 1.06, with one CPU’s system CPU usage increasing to 23.87 and the iowait percentage reaching 67.53%. This indicates that the increase in average load is due to the increase in iowait.

So which process exactly is causing such a high iowait? Let’s use pidstat to find out:

# Output a set of data every 5 seconds, -u represents CPU metrics
$ pidstat -u 5 1
Linux 4.15.0 (ubuntu)     09/22/18     _x86_64_    (2 CPU)
13:42:08      UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
13:42:13        0       104    0.00    3.39    0.00    0.00    3.39     1  kworker/1:1H
13:42:13        0       109    0.00    0.40    0.00    0.00    0.40     0  kworker/0:1H
13:42:13        0      2997    2.00   35.53    0.00    3.99   37.52     1  stress
13:42:13        0      3057    0.00    0.40    0.00    0.00    0.40     0  pidstat

We can see that it is still the stress process causing it.

Scenario 3: Multiple process scenario #

When the number of processes running in the system exceeds the CPU capacity, processes waiting for the CPU will occur.

For example, we can still use stress, but this time simulate 8 processes:

$ stress -c 8 --timeout 600

Since the system has only 2 CPUs, which is significantly less than 8 processes, the system’s CPU is severely overloaded, with an average load of 7.97:

$ uptime
...,  load average: 7.97, 5.93, 3.02

Next, run pidstat to see the status of the processes:

# Output a set of data every 5 seconds
$ pidstat -u 5 1
14:23:25      UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
14:23:30        0      3190   25.00    0.00    0.00   74.80   25.00     0  stress
14:23:30        0      3191   25.00    0.00    0.00   75.20   25.00     0  stress
14:23:30        0      3192   25.00    0.00    0.00   74.80   25.00     1  stress
14:23:30        0      3193   25.00    0.00    0.00   75.00   25.00     1  stress
14:23:30        0      3194   24.80    0.00    0.00   74.60   24.80     0  stress
14:23:30        0      3195   24.80    0.00    0.00   75.00   24.80     0  stress
14:23:30        0      3196   24.80    0.00    0.00   74.60   24.80     1  stress
14:23:30        0      3197   24.80    0.00    0.00   74.80   24.80     1  stress
14:23:30        0      3200    0.00    0.20    0.00    0.20    0.20     0  pidstat

It can be seen that the 8 processes are competing for the 2 CPUs, with each process waiting for the CPU time (as indicated by the %wait column in the code block) reaching 75%. These processes that exceed the CPU computation capacity ultimately lead to CPU overload.

Summary #

After analyzing these three cases, let me summarize the understanding of average load.

Average load provides a means to quickly view the overall performance of a system and reflects the overall load situation. However, by only looking at the average load itself, we cannot directly identify where the bottleneck is. Therefore, when understanding the average load, we should also pay attention to:

  • A high average load may be caused by CPU-intensive processes;

  • A high average load does not necessarily mean high CPU usage; it could also indicate increased I/O activity;

  • When encountering high loads, you can use tools such as mpstat and pidstat to assist in analyzing the source of the load.

Reflection #

Finally, I would like to invite you to discuss your understanding of average load with me. When you find that the average load has increased, how do you analyze and troubleshoot it? You can summarize your thoughts based on my previous explanation. Feel free to discuss with me in the comments.

unpreview - Limited edition, only 5000 copies available, covering 3 major systems and 22 modules, targeting 80% of common issues in everyday work.