17 Case Study How to Utilize System Cache to Optimize Program Execution Efficiency

17 Case Study How to Utilize System Cache to Optimize Program Execution Efficiency #

Hello, I’m Ni Pengfei.

In the previous section, we learned about the concepts of Buffer and Cache in memory performance. Let’s briefly review: the design purpose of Buffer and Cache is to improve I/O performance of the system. They utilize memory as a bridge between slow disks and fast CPUs, which accelerates the access speed of I/O.

Buffer and Cache cache the read and write data from disks and file systems respectively.

  • From the writing perspective, not only can they optimize the writing to disks and files, but they also benefit the application program. The application program can return and do other work before the data is actually written to the disk.

  • From the reading perspective, they can improve the reading speed of frequently accessed data and reduce the pressure of frequent I/O on disks.

Since Buffer and Cache have a significant impact on system performance, can we utilize this to optimize I/O performance in the software development process and improve the efficiency of application programs?

The answer is definitely yes. Today, I will use several case studies to help you better understand the role of caching and learn how to make full use of these caches to improve program efficiency.

For your convenience, I will still use the English terms Buffer and Cache to avoid confusion with the term “缓存” (cache) in Chinese. In this article, the term “缓存” refers to the temporary storage of data in memory.

Cache Hit Rate #

Before we start, you should habitually ask yourself a question: when you want to achieve something, how should you evaluate the results? For example, if we want to improve the efficiency of a program using caching, how should we evaluate the effectiveness? In other words, is there any indicator that can measure the quality of cache usage?

I assume you have already thought of cache hit rate. The cache hit rate refers to the percentage of the number of requests that directly obtain data from the cache to the total number of data requests.

The higher the hit rate, the higher the benefits of using caching, and the better the performance of the application.

In fact, caching is a core module essential for all highly concurrent systems nowadays. Its main function is to pre-load frequently accessed data (also known as hot data) into memory. This way, the next time the data is accessed, it can be directly read from memory without going through the hard disk, thereby speeding up the response time of the application.

These independent cache modules usually provide query interfaces that allow us to check the cache hit situation at any time. However, the Linux system does not directly provide these interfaces, so I want to introduce cachestat and cachetop here, which are tools for viewing system cache hit rates.

  • cachestat provides the read and write hit rates of the entire operating system cache.

  • cachetop provides the cache hit rates of each process.

These two tools are part of the bcc software package. They are based on the eBPF (extended Berkeley Packet Filters) mechanism in the Linux kernel to track the cache managed in the kernel and output cache usage and hit rates.

Note that the working principle of eBPF is not our focus today, just remember this name, we will study it in detail in future articles. The key point you need to grasp today is how to use these two tools.

Before using cachestat and cachetop, we need to install the bcc software package first. For example, on an Ubuntu system, you can run the following commands to install it:

sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 4052245BD4284CDD
echo "deb https://repo.iovisor.org/apt/xenial xenial main" | sudo tee /etc/apt/sources.list.d/iovisor.list
sudo apt-get update
sudo apt-get install -y bcc-tools libbcc-examples linux-headers-$(uname -r)

Note: bcc-tools requires a kernel version of 4.1 or later. If you are using CentOS, you need to manually upgrade the kernel version before installation.

After completing these steps, all the tools provided by bcc are installed in the directory /usr/share/bcc/tools. However, here’s a reminder that by default, the bcc software package does not configure these tools in the PATH of the system, so you need to configure it manually:

$ export PATH=$PATH:/usr/share/bcc/tools

After configuring it, you can run the cachestat and cachetop commands. For example, the following is a running interface of cachestat. It outputs 3 sets of cache statistics data at intervals of 1 second:

$ cachestat 1 3
   TOTAL   MISSES     HITS  DIRTIES   BUFFERS_MB  CACHED_MB
       2        0        2        1           17        279
       2        0        2        1           17        279
       2        0        2        1           17        279 

As you can see, the output of cachestat is actually a table. Each row represents a set of data, and each column represents a different cache statistic indicator. These indicators represent from left to right:

  • TOTAL: the total I/O count;

  • MISSES: the count of cache misses;

  • HITS: the count of cache hits;

  • DIRTIES: the number of dirty pages added to the cache;

  • BUFFERS_MB: the size of Buffers in MB;

  • CACHED_MB: the size of Cache in MB.

Next, let’s take a look at the running interface of cachetop:

$ cachetop
11:58:50 Buffers MB: 258 / Cached MB: 347 / Sort: HITS / Order: ascending
PID      UID      CMD              HITS     MISSES   DIRTIES  READ_HIT%  WRITE_HIT%
   13029 root     python                  1        0        0     100.0%       0.0%

Its output is similar to top. By default, it is sorted by the number of cache hits (HITS), and it displays the cache hit situation of each process. Specifically, for each indicator, here HITS, MISSES, and DIRTIES, they have the same meanings as in cachestat, representing the number of cache hits, misses, and dirty pages added to the cache during the interval, respectively.

And READ_HIT and WRITE_HIT represent the cache hit rates for reading and writing, respectively.

Cache Size of a Specified File #

In addition to the cache hit rate, there is another metric that you may be interested in, which is the cache size of a specified file in memory. You can use the tool pcstat to view the cache size and cache ratio of a file in memory.

pcstat is a tool developed in Go language, so before installing it, you should first install the Go language. You can click here to download and install it.

After installing Go language, run the following commands to install pcstat:

$ export GOPATH=~/go
$ export PATH=~/go/bin:$PATH
$ go get golang.org/x/sys/unix
$ go get github.com/tobert/pcstat/pcstat

Once everything is installed, you can run pcstat to view the cache status of a file. For example, the following is an example of running pcstat, which shows the cache status of the file /bin/ls:

$ pcstat /bin/ls
+---------+----------------+------------+-----------+---------+
| Name    | Size (bytes)   | Pages      | Cached    | Percent |
|---------+----------------+------------+-----------+---------|
| /bin/ls | 133792         | 33         | 0         | 000.000 |
+---------+----------------+------------+-----------+---------+

In this output, Cached indicates the size of /bin/ls in the cache, and Percent is the percentage of the cache. As you can see, both of them are 0, indicating that /bin/ls is not in the cache.

Next, if you execute the ls command and then run the same command to check again, you will find that /bin/ls is now in the cache:

$ ls
$ pcstat /bin/ls
+---------+----------------+------------+-----------+---------+
| Name    | Size (bytes)   | Pages      | Cached    | Percent |
|---------+----------------+------------+-----------+---------|
| /bin/ls | 133792         | 33         | 33        | 100.000 |
+---------+----------------+------------+-----------+---------+

Now that you know the relevant metrics for caching and how to view the system cache, let’s move on to today’s actual case.

Like the previous case, today’s case is also based on Ubuntu 18.04, but it is also applicable to other Linux systems.

  • Machine configuration: 2 CPUs, 8GB of memory.

  • Install the bcc and pcstat packages in advance according to the above steps, and add the installation paths of these tools to the PATH environment variable.

  • Install the Docker package in advance, such as apt-get install docker.io.

Case Study 1 #

In the first case study, let’s take a look at the dd command mentioned in the previous section.

dd is often used as a disk and file copy tool to test the read and write performance of disks or file systems. However, since caching can affect performance, what would happen if we use dd to perform multiple read tests on the same file?

Let’s give it a try. First, open two terminals and connect to an Ubuntu machine, making sure that bcc is installed and configured successfully.

Then, use the dd command to generate a temporary file for the subsequent file read test:

# Generate a temporary file of size 512MB
$ dd if=/dev/sda1 of=file bs=1M count=512
# Clean the cache
$ echo 3 > /proc/sys/vm/drop_caches

Continuing in the first terminal, run the pcstat command to confirm that the file just generated is not in the cache. If everything is normal, you will see that both Cached and Percent are 0:

$ pcstat file
+-------+----------------+------------+-----------+---------+
| Name  | Size (bytes)   | Pages      | Cached    | Percent |
|-------+----------------+------------+-----------+---------|
| file  | 536870912      | 131072     | 0         | 000.000 |
+-------+----------------+------------+-----------+---------+

Still in the first terminal, now run the cachetop command:

# Refresh the data every 5 seconds
$ cachetop 5

This time, in the second terminal, run the dd command to test the read speed of the file:

$ dd if=file of=/dev/null bs=1M
512+0 records in
512+0 records out
536870912 bytes (537 MB, 512 MiB) copied, 16.0509 s, 33.4 MB/s

From the result of dd, it can be seen that the read performance of this file is 33.4 MB/s. Since we have cleared the cache before running the dd command, dd needs to read the data from the disk through the file system.

However, does this mean that all read requests from dd are directly sent to the disk?

Let’s go back to the first terminal and check the cache hit rate in the cachetop interface:

PID      UID      CMD              HITS     MISSES   DIRTIES  READ_HIT%  WRITE_HIT%
\.\.\.
    3264 root     dd                  37077    37330        0      49.8%      50.2%

From the result of cachetop, it can be found that not all reads are hitting the disk. In fact, the cache hit rate of read requests is only 50%.

Next, let’s continue to try the same test command. Switch back to the second terminal and run the dd command again:

$ dd if=file of=/dev/null bs=1M
512+0 records in
512+0 records out
536870912 bytes (537 MB, 512 MiB) copied, 0.118415 s, 4.5 GB/s

Did you get a little surprised by the result of this time? The disk’s read performance has become 4.5 GB/s, significantly higher than the result of the first run. Why is the result so good this time?

Let’s go back to the first terminal and look at the situation in cachetop:

10:45:22 Buffers MB: 4 / Cached MB: 719 / Sort: HITS / Order: ascending
PID      UID      CMD              HITS     MISSES   DIRTIES  READ_HIT%  WRITE_HIT%
\.\.\.
   32642 root     dd                 131637        0        0     100.0%       0.0%

Obviously, cachetop has also changed a lot. You can see that the cache hit rate for this read is 100.0%, which means that all dd commands hit the cache, resulting in such high performance.

Then, go back to the second terminal and run pcstat again to check the cache status of the file “file”:

$ pcstat file
+-------+----------------+------------+-----------+---------+
| Name  | Size (bytes)   | Pages      | Cached    | Percent |
|-------+----------------+------------+-----------+---------|
| file  | 536870912      | 131072     | 131072    | 100.000 |
+-------+----------------+------------+-----------+---------+

From the result of pcstat, you can see that the test file “file” has been completely cached, which is consistent with the observed 100% cache hit rate.

These two results indicate that the system cache has a significant acceleration effect on the second dd operation, greatly improving the file read performance.

However, at the same time, it is also important to note that if we use dd as a tool to test the performance of the file system, the presence of caching will lead to seriously distorted test results.

Case Two #

Next, let’s look at another case of file reading and writing. This case is similar to the example we learned earlier about uninterruptible state processes. Its basic function is relatively simple, which is to read 32MB of data from the disk partition /dev/sda1 every second and print out the time it takes to read the data.

To make it easy for you to run this example, I have packaged it into a Docker image. Similar to the previous example, I provide the following two options for you to adjust the path of the disk partition and the size of I/O according to your system configuration:

  • The -d option sets the path of the disk or partition to be read. The default is to search for disks with the prefix /dev/sd or /dev/xvd.

  • The -s option sets the size of data to be read each time, in bytes. The default is 33554432 (32MB).

This example also requires you to open two terminals. After logging in to the machine via SSH in each of the terminals, first run the cachetop command in the first terminal:

# Refresh the data every 5 seconds
$ cachetop 5

Then, in the second terminal, execute the following command to run the example:

$ docker run --privileged --name=app -itd feisky/app:io-direct

After running the example, we need to run the following command to confirm that the example has started successfully. If everything is normal, you should see similar output as below:

$ docker logs app
Reading data from disk /dev/sdb1 with buffer size 33554432
Time used: 0.929935 s to read 33554432 bytes
Time used: 0.949625 s to read 33554432 bytes

From here, you can see that it takes 0.9 seconds to read 32MB of data. Is this time reasonable? I think your first reaction might be that it’s too slow. Is this because the system cache is not being used?

Let’s check again. Go back to the first terminal and take a look at the output of cachetop. Here, we find the cache usage of the example process app:

16:39:18 Buffers MB: 73 / Cached MB: 281 / Sort: HITS / Order: ascending
PID      UID      CMD              HITS     MISSES   DIRTIES  READ_HIT%  WRITE_HIT%
   21881 root     app                  1024        0        0     100.0%       0.0% 

This output seems interesting. All 1024 cache accesses are hits, and the read hit rate is 100%, indicating that all read requests went through the system cache. But here comes the problem again, if it’s really all cache I/O, the read speed should not be so slow.

However, let’s think about another important factor that we seem to have overlooked, which is the actual amount of data read per second. HITS represents the number of cache hits, so how much data can be read per hit? Naturally, it is one page.

As mentioned earlier, memory is managed in units of pages, and each page has a size of 4KB. So, within a 5-second interval, the cached hits are 1024 * 4KB / 1024 = 4MB. Then, divided by 5 seconds, we can get the cache read per second to be 0.8MB, which is obviously far from the 32MB/s of the example application.

As for why we can only see 0.8MB of HITS, we will explain it later. For now, you just need to know how to analyze based on the results.

This further confirms our speculation that this example probably did not fully utilize the system cache. In fact, we have encountered similar problems before. If we set the direct I/O flag for system calls, we can bypass the system cache.

So, to determine whether the application is using direct I/O, the simplest way is to observe its system calls and examine the options it uses when calling them. What tool can we use to observe system calls? Naturally, it’s strace again.

Continue to run the following strace command in the second terminal to observe the system calls of the example application. Note that the pgrep command is used here to find the PID of the example process:

# strace -p $(pgrep app)
strace: Process 4988 attached
restart_syscall(<... resuming interrupted nanosleep ...>) = 0
openat(AT_FDCWD, "/dev/sdb1", O_RDONLY|O_DIRECT) = 4
mmap(NULL, 33558528, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f448d240000
read(4, "8vq\213\314\264u\373\4\336K\224\25@\371\1\252\2\262\252q\221\n0\30\225bD\252\266@J"..., 33554432) = 33554432
write(1, "Time used: 0.948897 s to read 33"..., 45) = 45
close(4)                                = 0

From the results of strace, we can see that the application in question used openat to open the disk partition /dev/sdb1 with the parameters O_RDONLY|O_DIRECT (the vertical line represents “or”).

O_RDONLY indicates opening in read-only mode, while O_DIRECT indicates opening in direct I/O mode, bypassing the system cache.

Having verified this, it is easy to understand why reading 32 MB of data is taking so long. Reading and writing directly to the disk is naturally much slower than reading from and writing to the cache. This is also the main purpose of the cache.

After identifying the issue, we can take a look at the source code of the application to further validate:

int flags = O_RDONLY | O_LARGEFILE | O_DIRECT; 
int fd = open(disk, flags, 0755);

The above code clearly tells us that it did indeed use direct I/O.

Now that we have identified the reason for the slow disk read, optimizing the performance of disk read becomes straightforward. By modifying the source code and removing the O_DIRECT option, we can make the application use cached I/O instead of direct I/O, which will speed up the disk read.

app-cached.c is the fixed source code, and I have also packaged it into a container image. In the second terminal, stop the previous strace command by pressing Ctrl+C, and run the following command to start it:

# Remove the previous application
$ docker rm -f app

# Run the fixed application
$ docker run --privileged --name=app -itd feisky/app:io-cached

Still in the second terminal, run the following command to view the logs of the new application. You should see the following output:

$ docker logs app
Reading data from disk /dev/sdb1 with buffer size 33554432
Time used: 0.037342 s to read 33554432 bytes
Time used: 0.029676 s to read 33554432 bytes

Now, it only takes 0.03 seconds to read 32 MB of data, which is much faster than the previous 0.9 seconds. Therefore, it should be using the system cache this time.

Let’s go back to the first terminal and check the output of cachetop to confirm:

16:40:08 Buffers MB: 73 / Cached MB: 281 / Sort: HITS / Order: ascending
PID      UID      CMD              HITS     MISSES   DIRTIES  READ_HIT%  WRITE_HIT%
   22106 root     app                 40960        0        0     100.0%       0.0%

Indeed, the read hit rate is still 100%, and the HITS (i.e., the number of hits) has become 40960. Using the same method to calculate, it translates to 32 MB per second (i.e., 40960 * 4k / 5 / 1024 = 32M).

This case demonstrates that utilizing the system cache to its fullest extent can greatly improve performance when performing I/O operations. However, when observing cache hit rates, it is also important to consider the actual I/O size of the application and analyze the usage of the cache comprehensively.

In the end, let’s revisit the initial question: why, before optimization, could we only see full hits for a small part of the data using cachetop, and not observe a large number of misses for the data? This is because cachetop does not include direct I/O in its calculations. Once again, this emphasizes the importance of understanding the principles behind tools.

The calculation method of cachetop involves I/O principles and some kernel knowledge. If you want to understand its principles, you can click here to view its source code.

Summary #

Buffers and Cache can greatly improve system I/O performance. Normally, we use cache hit rate to measure the efficiency of cache usage. The higher the hit rate, the more effectively the cache is utilized, and the better the performance of the application.

You can use the cachestat and cachetop tools to observe the cache hit situation of the system and processes. Among them,

  • cachestat provides the read and write hit situation of the entire system cache.

  • cachetop provides the cache hit situation of each process.

However, it should be noted that Buffers and Cache are managed by the operating system, and applications cannot directly control the contents and lifecycles of these caches. Therefore, in application development, specialized cache components are generally used to further improve performance.

For example, the program can explicitly declare memory space using the heap or stack to store the data that needs to be cached. Alternatively, external cache services like Redis can be used to optimize data access efficiency.

Reflection #

Finally, I would like to leave you with a question to ponder, in order to help you further understand the principles of caching.

The second case study today should be familiar to you, as we used another example of direct I/O in the previous article on uninterruptible processes. However, that time we analyzed it from the perspective of CPU usage and process status. Comparing the analytical approaches from these two different angles, what insights did you gain?

Feel free to discuss with me in the comments section, and share your answers and takeaways. You are also welcome to share this article with your colleagues and friends. Let’s practice in real-world scenarios and make progress through communication.