26 Case Study Finding the Internal Logger That Massively Writes Logs

26 Case Study Finding the Internal Logger that Massively Writes Logs #

Hello, I’m Ni Pengfei.

In the previous two sections, we learned about the file system and the I/O principles of the disk. Let me review them with you.

The file system is a mechanism that organizes and manages files on storage devices. To support various file systems, Linux abstracts a layer called the virtual file system (VFS) on top of different file systems.

VFS defines a set of data structures and standard interfaces supported by all file systems. This way, applications and other subsystems in the kernel only need to interact with the unified interface provided by VFS.

At the lower level of the file system, in order to support various types of storage devices, Linux abstracts a generic block layer based on various storage devices.

The generic block layer provides a standard interface for file systems and applications to access block devices, as well as a unified framework for various block device drivers. In addition, the generic block layer queues I/O requests received from file systems and applications, and improves the efficiency of disk read/write operations through reordering and request merging.

The next layer below the generic block layer is naturally the device layer, which includes various block device drivers and physical storage devices.

The file system, generic block layer, and device layer together constitute Linux’s storage I/O stack. I/O operations in a storage system are usually the slowest part of the entire system. Therefore, Linux uses various caching mechanisms to optimize I/O efficiency. For example,

To optimize file access performance, multiple caching mechanisms such as page cache, inode cache, and directory entry cache are used to reduce direct calls to underlying block devices.
Similarly, to optimize block device access efficiency, buffers are used to cache data from block devices.

However, when encountering file system and disk I/O issues, how should we locate and analyze them specifically? Today, I will use a case study of an application that generates a large amount of logs to analyze this situation.

Case Preparation #

This case is based on Ubuntu 18.04 and is also applicable to other Linux systems. The environment for this case is as follows:

Machine configuration: 2 CPUs, 8GB memory
Pre-install docker, sysstat and other tools, such as apt install docker.io sysstat

I want to express my thanks to Yang Xiangyi, a senior operations engineer at Vipshop, for helping me and sharing the workload for today’s case. This case is about a small application developed in Python. To make it easy to run, I have packaged it into a Docker image. This way, you just need to run the Docker command to start it.

Next, open a terminal and SSH into the machine used in the case, and install the aforementioned tools. As before, all the commands in the case are assumed to be run as the root user. If you log in to the system as a regular user, please run the command sudo su root to switch to the root user.

With that, the preparation work is complete. Next, we will formally enter the operation phase.

Friendly reminder: The core logic of the Python application in the case is relatively simple, and you may be able to spot the problem at a glance. However, the source code in actual production environments is much more complex. Therefore, I still recommend not looking at the source code before performing the operation to avoid preconceived biases. Treat it as a black box for analysis. This way, you can better grasp how to analyze an application that has performance bottlenecks from the perspective of system resource usage and approximately locate the bottleneck within the application.

Case Study #

First, we execute the following command in the terminal to run today’s target application:

$ docker run -v /tmp:/tmp --name=app -itd feisky/logapp

Next, we run the ps command in the terminal to confirm that the application is running correctly. If everything is correct, you should see a process called app.py in the output of ps:

$ ps -ef | grep /app.py 
root     18940 18921 73 14:41 pts/0    00:00:02 python /app.py

Then, let’s check if there are any performance issues in the system. What performance indicators should we observe? As mentioned in the previous article, we know that system resources such as CPU, memory, and disk I/O are susceptible to bottlenecks. So that’s what we are going to focus on. Let’s observe the usage of these resources.

Of course, before you proceed, you should think about which tools to use and the order in which to use them. You can review the previous cases and ideas, think for yourself, and then continue with the following steps.

My idea is that we can start by using top to observe the usage of CPU and memory, and then use iostat to observe the disk I/O.

You can now run the top command in the terminal to observe the usage of CPU and memory:

# Press 1 to switch to the usage of each CPU 
$ top 
top - 14:43:43 up 1 day,  1:39,  2 users,  load average: 2.48, 1.09, 0.63 
Tasks: 130 total,   2 running,  74 sleeping,   0 stopped,   0 zombie 
%Cpu0  :  0.7 us,  6.0 sy,  0.0 ni,  0.7 id, 92.7 wa,  0.0 hi,  0.0 si,  0.0 st 
%Cpu1  :  0.0 us,  0.3 sy,  0.0 ni, 92.3 id,  7.3 wa,  0.0 hi,  0.0 si,  0.0 st 
KiB Mem :  8169308 total,   747684 free,   741336 used,  6680288 buff/cache 
KiB Swap:        0 total,        0 free,        0 used.  7113124 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND 
18940 root      20   0  656108 355740   5236 R   6.3  4.4   0:12.56 python 
1312 root      20   0  236532  24116   9648 S   0.3  0.3   9:29.80 python3

Observing the output of top, you will notice that the usage of CPU0 is very high, with a system CPU usage (sys%) of 6% and an iowait of over 90%. This indicates that an I/O intensive process may be running on CPU0. But what is the reason? Let’s leave this question for now and continue.

Next, let’s take a look at the CPU usage of the processes. You will notice that the CPU usage of the python process has reached 6%, while the CPU usage of other processes is relatively low, not exceeding 0.3%. It seems that the python process is suspicious. Note down the PID (Process ID) of the python process, which is 18940, for further analysis.

Lastly, let’s check the memory usage. The total memory is 8GB, and the available memory is only 730MB. The Buffer/Cache takes up as much as 6GB of memory, indicating that the memory is mainly occupied by cache. Although most of the cache can be reclaimed, we still need to understand where the cache is being used to ensure that its usage is reasonable.

At this point, you can basically conclude that the iowait in the CPU usage is a potential bottleneck, and the cache utilization in the memory portion is high. Now, what about the disk I/O?

Press Ctrl+C in the terminal to stop the top command, and then run the iostat command to observe the I/O usage:

# -d indicates display I/O performance metrics, -x indicates display extended statistics (i.e., all I/O metrics) 
$ iostat -x -d 1 
Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util 
loop0            0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00
sda              0.00   64.00      0.00  32768.00     0.00     0.00   0.00   0.00    0.00 7270.44 1102.18     0.00   512.00  15.50  99.20

Do you remember the meaning of these performance metrics? Take a moment to recall on your own. If you can’t remember, you can refer to the previous section, or use the man iostat command to check.

If you look at the last column of iostat, you will see that the I/O utilization of disk sda has reached 99%, indicating that it is likely already approaching I/O saturation.

Looking at the preceding metrics, the number of disk write requests per second is 64, the write size is 32 MB, the response time for write requests is 7 seconds, and the request queue length has reached 1100.

The extremely slow response time and the very long request queue length further confirm the suspicion that the I/O is already saturated. At this point, disk sda has encountered a severe performance bottleneck.

Now it is easier to understand why the previous observation showed a high iowait of 90%. This is caused by the I/O bottleneck of disk sda. The focus of the analysis now is to identify the root cause of the I/O performance bottleneck. How can we determine the processes associated with these I/O requests?

I’m not sure if you still remember, but in the previous section, I mentioned that you can use pidstat or iotop to monitor the I/O activity of processes. Here, I will use pidstat to take a look.

By using pidstat with the -d option, you can display the I/O activity of each individual process. So, you can run the following command in the terminal to observe:

$ pidstat -d 1

15:08:35      UID       PID   kB_rd/s   kB_wr/s kB_ccwr/s iodelay  Command
15:08:36        0     18940      0.00  45816.00      0.00      96  python

15:08:36      UID       PID   kB_rd/s   kB_wr/s kB_ccwr/s iodelay  Command
15:08:37        0       354      0.00      0.00      0.00     350  jbd2/sda1-8
15:08:37        0     18940      0.00  46000.00      0.00      96  python
15:08:37        0     20065      0.00      0.00      0.00    1503  kworker/u4:2

From the output of pidstat, you can notice that only the python process has a significant amount of write activity, and it writes over 45 MB of data per second, which is even larger than the 32 MB reported by iostat. Clearly, the python process is causing the I/O bottleneck.

Now, let’s focus on the iodelay column. Although only the python process has a large amount of write activity, you should also note that the latency of two processes (kworker and jbd2) is even higher than the python process.

Among them, kworker is a kernel thread, and jbd2 is a kernel thread used to ensure data integrity in the ext4 file system. They are both kernel threads responsible for basic file system functionality, so we don’t need to worry about the specific details for now. We only need to understand that their latency is still caused by a large amount of I/O.

Based on the output of pidstat, the suspicion still falls on the python process. Next, let’s analyze what exactly the python process is writing.

First, pay attention to the PID number of the python process, which is 18940. Does it look familiar? In fact, it is the process we identified as having the highest CPU usage in the top command. However, although it has the highest CPU usage in top, it is only 6%, which is not considered high. Therefore, focusing on I/O issues for analysis is the correct approach.

Now that we know the PID number of the process, how can we check what it is writing specifically?

In fact, as I mentioned in the system call example, reading and writing files must be done through system calls. By observing system call activity, we can determine which files the process is writing to. Does strace come to mind? It is the most commonly used tool for analyzing system calls.

Next, we will run the strace command in the terminal, specifying the PID number of the python process with the -p 18940 option:

$ strace -p 18940
strace: Process 18940 attached
...
mmap(NULL, 314576896, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f0f7aee9000
mmap(NULL, 314576896, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f0f682e8000
write(3, "2018-12-05 15:23:01,709 - __main"..., 314572844
) = 314572844
munmap(0x7f0f682e8000, 314576896)       = 0

write(3, "\n", 1)                       = 1 
munmap(0x7f0f7aee9000, 314576896)       = 0 
close(3)                                = 0 
stat("/tmp/logtest.txt.1", {st_mode=S_IFREG|0644, st_size=943718535, ...}) = 0 
    

From the write() system call, we can see that the process wrote 300MB of data to the file with file descriptor number 3. It seems to be the file we are looking for. However, from the write() call, we can only see the file descriptor number, and the file name and path are still unknown.

Looking at the stat() call that follows, you can see that it is retrieving the status of /tmp/logtest.txt.1. This "dot+number" format file is very common in log rotation. We can guess that this is the first log rotation file, and the file currently being written to is /tmp/logtest.txt.

Of course, this is only our speculation and needs to be further verified. Here, let me introduce you to a new tool called lsof. It is specifically used to view a process's open file list, including not only regular files but also directories, block devices, shared libraries, network sockets etc.

Next, let's run the following lsof command in the terminal to see which files process 18940 has opened:

$ lsof -p 18940 
COMMAND   PID USER   FD   TYPE DEVICE  SIZE/OFF    NODE NAME 
python  18940 root  cwd    DIR   0,50      4096 1549389 / 
python  18940 root  rtd    DIR   0,50      4096 1549389 / 
… 
python  18940 root    2u   CHR  136,0       0t0       3 /dev/pts/0 
python  18940 root    3w   REG    8,1 117944320     303 /tmp/logtest.txt 

In this output, I will briefly introduce several columns. FD represents the file descriptor, TYPE represents the file type, and NAME represents the file path. These are the key points we need to pay attention to.

Looking at the last line, it shows that this process has opened the file /tmp/logtest.txt, and its file descriptor is number 3. The 'w' after 3 means it is opened in write-only mode.

This is consistent with our previous analysis based on the strace. It seems that this is the root cause of the problem: process 18940 is "crazily" writing logs at a rate of 300MB per write, and the file path of the log being written is /tmp/logtest.txt.

Now that we have found the root cause of the problem, the next step is to check the source code and analyze why this process is logging so excessively.

You can run the docker cp command to copy the source code of the application out and then view its contents. (You can also click [here](https://github.com/feiskyer/linux-perf-examples/tree/master/logging-app) to view the source code of the application):

# Copy the source code of the application to the current directory
$ docker cp app:/app.py . 

# View the source code of the application
$ cat app.py 

logger = logging.getLogger(__name__) 
logger.setLevel(level=logging.INFO) 
rHandler = RotatingFileHandler("/tmp/logtest.txt", maxBytes=1024 * 1024 * 1024, backupCount=1) 
rHandler.setLevel(logging.INFO) 

def write_log(size): 
  '''Write logs to file''' 
  message = get_message(size) 
  while True: 
    logger.info(message) 
    time.sleep(0.1) 

if __name__ == '__main__': 
  msg_size = 300 * 1024 * 1024 
  write_log(msg_size) 

Analyzing this source code, we can see that it logs to the path /tmp/logtest.txt, records all logs from INFO level and higher, and writes logs in 300MB chunks. This is consistent with our previous analysis.

Generally, production systems should have the functionality to dynamically adjust log levels. Continuing to examine the source code, we can see that this program can also adjust the log level. If it receives the SIGUSR1 signal, it will adjust the log level to INFO level; if it receives the SIGUSR2 signal, it will adjust the log level to WARNING level:

def set_logging_info(signal_num, frame): 
  '''Set loging level to INFO when receives SIGUSR1''' 
  logger.setLevel(logging.INFO) 

def set_logging_warning(signal_num, frame): 
  '''Set loging level to WARNING when receives SIGUSR2''' 
  logger.setLevel(logging.WARNING) 

signal.signal(signal.SIGUSR1, set_logging_info) 
signal.signal(signal.SIGUSR2, set_logging_warning) 

Based on the log calls logger.info(message) in the source code, we know that its log level is INFO, which is also the default level. Therefore, as long as we raise the default level to WARNING, the logging problem should be resolved.

Next, let's check if our analysis is correct. Run the following kill command in the terminal to send the SIGUSR2 signal to process 18940:

$ kill -SIGUSR2 18940 

Then, execute the top and iostat commands to observe:

$ top 
... 
%Cpu(s):  0.3 us,  0.2 sy,  0.0 ni, 99.5 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st 

$ iostat -d -x 1
Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util 
loop0            0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00 
sdb              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00 
sda              0.00    0.00      0.00      0.00     0.00     0.00   0.00   0.00    0.00    0.00   0.00     0.00     0.00   0.00   0.00

Observing the output of top and iostat, you will notice that after waiting for a while, the iowait goes to 0, and the I/O utilization of disk sda gradually reduces to 0.

At this point, we have not only identified the application that is logging excessively but also solved the I/O performance bottleneck by raising the log level.

Finally, don't forget to run the following command to stop the application:

$ docker rm -f app

Summary #

Logs are the most commonly used and effective tool for understanding the internal operations of an application. Both operating systems and applications record a large amount of running logs for later reference. These logs are usually enabled at different levels, for example, debug-level logs are usually enabled in development environments, while warning and error logs are recorded in production environments.

When troubleshooting application issues, we might need to temporarily enable the debug logs of the application in the production environment. Sometimes, we accidentally forget to adjust it back. If the logs in the production environment are not set to the warning level, it may cause a series of performance issues such as high CPU usage and disk I/O. In severe cases, it may even affect other applications running on the same server.

In the future, when encountering scenarios where excessive logging occurs, you can use tools such as iostat, strace, and lsof to locate the process that is generating excessive logs, find the corresponding log files, and then adjust the log level through the application’s interface to resolve the problem.

If the application cannot dynamically adjust the log level, you may also need to modify the application’s configuration and restart the application for the changes to take effect.

Reflection #

Finally, I have a question for you to ponder.

At the beginning of today’s case, we used top and iostat to view the usage of system resources. Apart from CPU and disk I/O, the remaining memory is also relatively low, and the memory is mainly occupied by Buffer/Cache.

So, the question for today is, is this memory occupied by Buffer or Cache? Is there any way to confirm your analysis results?

Feel free to discuss with me in the comments section, and also feel free to share this article with your colleagues and friends. Let’s practice in real scenarios and improve through communication.