52 Case Study the Server's Throughput Drops Significantly, How to Analyze

52 Case Study The Server’s Throughput Drops Significantly, How to Analyze #

Hello, I’m Ni Pengfei.

In the previous section, we learned how to use dynamic tracing to observe the behavior of applications and kernel. Let’s briefly review.

Dynamic tracing means that while the system or application is still running normally, we can use the probes provided in the kernel to dynamically trace their behavior, thus assisting in identifying performance bottlenecks.

With dynamic tracing, we can understand the behavior of applications or kernel dynamically without modifying code or restarting services. This is particularly effective for troubleshooting online issues, especially those that are difficult to reproduce.

In the Linux system, common methods for dynamic tracing include ftrace, perf, eBPF/BCC, and SystemTap.

Using perf combined with flame graphs to find hotspot functions is a commonly used performance profiling method that can be used in many scenarios.
If this is not enough, eBPF and BCC are the most flexible dynamic tracing methods in the latest kernel versions.
In older kernel versions, especially in RHEL systems where eBPF support is limited, SystemTap and ftrace are often better choices.

In the case of increased network request latency, I walked you through the analysis of a problem where the network request latency increased. At that time, we discovered that it was caused by the server enabling the TCP Nagle algorithm while the client had delayed acknowledgment enabled.

In addition to latency issues, a decrease in service throughput is another common performance problem with network requests. So, how do we analyze such a decrease in throughput problem?

Next, I will use the most commonly used reverse proxy server Nginx as an example to show you how to analyze the problem of a decrease in service throughput.

Preparing the Case #

Today’s case requires two virtual machines, still based on Ubuntu 18.04, which is also applicable to other Linux systems. The case environment I am using is as follows:

Machine configuration: 2 CPUs, 8GB RAM.
Pre-installed tools: docker, curl, wrk, perf, FlameGraph, etc. For example:

# Install necessary utilities such as docker, curl, and perf
$ apt-get install -y docker.io curl build-essential linux-tools-common

# Install FlameGraph tool
$ git clone https://github.com/brendangregg/FlameGraph

# Install wrk
$ git clone https://github.com/wg/wrk
$ cd wrk && make && sudo cp wrk /usr/local/bin/

We have used these tools multiple times in previous cases, so we won’t repeat them here. You can open two terminals, log in to these two virtual machines respectively, and install the above tools.

Note: All commands below are assumed to be executed as the root user. If you are logged in as a regular user, please run the command sudo su root to switch to the root user.

With that, the preparation work is completed. Next, we will officially enter the operational phase.

Case Study #

The case we are going to analyze today is an Nginx + PHP application, and their relationship is shown in the following diagram:

Among them, wrk and curl are Nginx clients, and the PHP application is a simple Hello World:

<?php
    echo "Hello World!"
?>

To facilitate running, I have packaged the case application into two Docker images and pushed them to Docker Hub. You can directly follow the steps below to run it.

At the same time, for convenience of analysis, both containers will run in host network mode. This way, we don’t need to switch to the container’s network namespace and can directly observe their socket status.

First, in Terminal 1, execute the following command to start the Nginx application and listen on port 80. If everything goes well, you should see the following output:

$ docker run --name nginx --network host --privileged -itd feisky/nginx-tp
6477c607c13b37943234755a14987ffb3a31c33a7f04f75bb1c190e710bce19e
$ docker run --name phpfpm --network host --privileged -itd feisky/php-fpm-tp
09e0255159f0c8a647e22cd68bd097bec7efc48b21e5d91618ff29b882fa7c1f

Then, execute the docker ps command to check the status of the containers, you will find that the containers are already in running state (Up):

$ docker ps
CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS               NAMES
09e0255159f0        feisky/php-fpm-tp   "php-fpm -F --pid /o…"   28 seconds ago      Up 27 seconds                           phpfpm
6477c607c13b        feisky/nginx-tp     "/init.sh"               29 seconds ago      Up 28 seconds                           nginx

However, from the output of docker ps, we can only know that the containers are in running state. Whether Nginx can handle external requests normally still needs further confirmation.

Next, switch to Terminal 2 and execute the following curl command to further verify if Nginx can be accessed normally. If you see the output “Hello World!”, it means the Nginx+PHP application has started successfully:

$ curl http://192.168.0.30
Hello World!

Tip: If you see different results, you can execute docker ps -a again to confirm the status of the container, and execute docker logs <container_name> to view the container logs to find out the reason.

Next, let’s test the throughput of Nginx in the case.

In Terminal 2, continue running the wrk command to test the performance of Nginx:

# The default test time is 10s, and the request timeout is 2s
$ wrk --latency -c 1000 http://192.168.0.30
Running 10s test @ http://192.168.0.30
  2 threads and 1000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    14.82ms   42.47ms 874.96ms   98.43%
    Req/Sec   550.55      1.36k    5.70k    93.10%
  Latency Distribution
     50%   11.03ms
     75%   15.90ms
     90%   23.65ms
     99%  215.03ms
  1910 requests in 10.10s, 573.56KB read
  Non-2xx or 3xx responses: 1910
Requests/sec:    189.10
Transfer/sec:     56.78KB

From the result of wrk, you can see that the throughput (requests per second) is only 189, and all 1910 requests received were abnormal responses (non-2xx or 3xx). These data obviously indicate that the throughput is too low and all request processing failed. What’s going on?

Based on the statistical results outputted by wrk, we can see that the total transmitted data is only 573 KB, so it is definitely not limited by bandwidth. Therefore, we should analyze from the perspective of the number of requests.

When analyzing the number of requests, especially the number of HTTP requests, do you have any good ideas? Of course, we need to start with the number of TCP connections.

Connection Optimization #

To view the summary information of TCP connections, the preferred tool is naturally the ss command. To observe the problem that occurred during the wrk test, let’s start wrk again in Terminal 2, and extend the total test time to 30 minutes:

# Test time is 30 minutes
$ wrk --latency -c 1000 -d 1800 http://192.168.0.30

Then, go back to Terminal 1 and observe the TCP connections:

$ ss -s
Total: 177 (kernel 1565)
TCP:   1193 (estab 5, closed 1178, orphaned 0, synrecv 0, timewait 1178/0), ports 0

Transport Total     IP        IPv6
*	  1565      -         -
RAW	  1         0         1
UDP	  2         2         0
TCP	  15        12        3
INET	  18        14        4
FRAG	  0         0         0

From here, we can see that when wrk has 1000 concurrent requests, there are only 5 established connections, while there are more than 1100 connections in the closed and timewait states. In fact, from here, you can already find two problems:

One is that the number of established connections is too small;
The other is that there are too many connections in the timewait state.

When analyzing problems, naturally we should start with the relatively simple ones. Let’s first look at the second problem about timewait. I have already mentioned in the previous NAT case that the connection tracking module in the kernel may cause timewait issues. Today’s case is also based on Docker, and Docker uses iptables, which uses the connection tracking module to manage NAT. So, how can we confirm whether it is a problem caused by connection tracking?

In fact, the simplest method is to use dmesg to view the system logs. If there is a problem with connection tracking, you should see logs related to nf_conntrack. We can continue in Terminal 1 and run the following command to view the system logs:

$ dmesg | tail
[88356.354329] nf_conntrack: nf_conntrack: table full, dropping packet
[88356.354374] nf_conntrack: nf_conntrack: table full, dropping packet

From the logs, you can see the error message nf_conntrack: table full, dropping packet. This indicates that the problem is caused by connection tracking.

In this case, we should recall the two kernel options we learned before - the maximum limit of connection tracking numbers, nf_conntrack_max, and the current number of connection tracking, nf_conntrack_count. You can query these two options by executing the following command:

$ sysctl net.netfilter.nf_conntrack_max
net.netfilter.nf_conntrack_max = 200
$ sysctl net.netfilter.nf_conntrack_count
net.netfilter.nf_conntrack_count = 200

In the output this time, you can see that the maximum connection tracking limit is only 200 and all are already in use. The limit of 200 is obviously too small, but the corresponding optimization is also simple - just increase it.

Let’s execute the following command to increase nf_conntrack_max:

# Increase the connection tracking limit to 1048576
$ sysctl -w net.netfilter.nf_conntrack_max=1048576

After increasing the connection tracking limit, how does it affect the throughput optimization of Nginx? Let’s test it again. You can switch to Terminal 2, press Ctrl+C, and then execute the following wrk command to test the performance of Nginx:

# Default test time is 10s, request timeout is 2s
$ wrk --latency -c 1000 http://192.168.0.30
...
  54221 requests in 10.07s, 15.16MB read
  Socket errors: connect 0, read 7, write 0, timeout 110
  Non-2xx or 3xx responses: 45577
Requests/sec:   5382.21
Transfer/sec:      1.50MB

From the output of wrk, you can see that the optimization of connection tracking is very effective, and the throughput has increased from the previous 189 to 5382. It seems that the performance has improved nearly 30 times.

However, can we conclude that we have already optimized the performance of Nginx?

Don’t rush. Let’s take a look at the other data reported by wrk. Indeed, although the total number of requests in 10s has increased to 50,000, there are more than 40,000 abnormal responses, which means that only a little over 8,000 (54221-45577=8644) are truly successful.

Clearly, the majority of the responses are abnormal. So, how do we analyze the problem with the abnormal responses?

Worker Process Optimization #

Since these responses are not Socket errors, it means that Nginx has received the requests. However, the response status codes are not the expected 2xx (indicating success) or 3xx (indicating redirection). Therefore, it is important to figure out Nginx’s true responses in this case.

But this is not difficult either. Let’s switch back to Terminal 1 and execute the following docker command to check the Nginx container logs:

$ docker logs nginx --tail 3
192.168.0.2 - - [15/Mar/2019:22:43:27 +0000] "GET / HTTP/1.1" 499 0 "-" "-" "-"
192.168.0.2 - - [15/Mar/2019:22:43:27 +0000] "GET / HTTP/1.1" 499 0 "-" "-" "-"
192.168.0.2 - - [15/Mar/2019:22:43:27 +0000] "GET / HTTP/1.1" 499 0 "-" "-" "-"

From the Nginx logs, we can see that the response status code is 499.

499 is not a standard HTTP status code, but it is an extension provided by Nginx, indicating that the client has closed the connection before the server has had a chance to respond. In other words, the problem lies with the slow server-side processing, and the client actively disconnects due to a timeout (wrk timeout is set to 2s).

Since the problem is with slow server-side processing and the case itself is an Nginx+PHP application, can we speculate that it is due to slow PHP processing?

We can execute the following docker command in the terminal to check the PHP container logs:

$ docker logs phpfpm --tail 5
[15-Mar-2019 22:28:56] WARNING: [pool www] server reached max_children setting (5), consider raising it
[15-Mar-2019 22:43:17] WARNING: [pool www] server reached max_children setting (5), consider raising it

From this log, we can see two warning messages, “server reached max_children setting (5)”, and it suggests increasing the max_children setting.

max_children represents the maximum number of php-fpm child processes. The larger the number, the more requests it can handle simultaneously. However, since this is a process issue, increasing the number will also result in more memory and CPU usage. Therefore, we cannot set it too large.

Generally, each php-fpm child process may occupy around 20 MB of memory. So, you can estimate a reasonable value based on the available memory and CPU count. Here, I have set it to 20, and repackaged the optimized configuration into a Docker image. You can execute the following command to run it:

# Stop the old containers
$ docker rm -f nginx phpfpm

# Start Nginx and PHP with the new image
$ docker run --name nginx --network host --privileged -itd feisky/nginx-tp:1
$ docker run --name phpfpm --network host --privileged -itd feisky/php-fpm-tp:1

Then, we switch to terminal 2 and run the following wrk command to retest Nginx performance:

# The default test duration is 10s, with a request timeout of 2s
$ wrk --latency -c 1000 http://192.168.0.30
...
  47210 requests in 10.08s, 12.51MB read
  Socket errors: connect 0, read 4, write 0, timeout 91
  Non-2xx or 3xx responses: 31692
Requests/sec:   4683.82
Transfer/sec:      1.24MB

From the wrk output, we can see that although the throughput is only 4683, which is slightly lower than the previous 5382; the number of successful requests during the test has increased significantly, from the original 8000 to 15000 (47210-31692=15518).

However, even though there has been some improvement in performance, a throughput of just over 4000 is still relatively poor, and the majority of response still remains as abnormal. So, how can we further improve the throughput of Nginx?

Socket Optimization #

Recalling the analysis approach for network performance and the principles of the Linux protocol stack, we can gradually analyze from the socket, TCP protocol, and other layers. The first step in analysis is to observe whether packet loss occurs.

We switch to terminal 2 and run the test again, this time with the -d parameter to extend the test duration and simulate the performance bottleneck scenario.

# Test time: 30 minutes

```bash
$ wrk --latency -c 1000 -d 1800 http://192.168.0.30

Then go back to Terminal 1 and observe if any socket packet loss occurs:

$ netstat -s | grep socket
    73 resets received for embryonic SYN_RECV sockets
    308582 TCP sockets finished time wait in fast timer
    8 delayed acks further delayed because of locked socket
    290566 times the listen queue of a socket overflowed
    290566 SYNs to LISTEN sockets dropped

# Wait for a while and run again
$ netstat -s | grep socket
    73 resets received for embryonic SYN_RECV sockets
    314722 TCP sockets finished time wait in fast timer
    8 delayed acks further delayed because of locked socket
    344440 times the listen queue of a socket overflowed
    344440 SYNs to LISTEN sockets dropped

Based on the changes in socket overflowed and sockets dropped in the two statistics, you can see that there are a large number of socket packet losses, and the losses are caused by socket queue overflow. Therefore, next we should analyze whether the size of the connection queue is abnormal.

You can execute the following command to view the queue size of the socket:

$ ss -ltnp
State     Recv-Q     Send-Q            Local Address:Port            Peer Address:Port
LISTEN    10         10                      0.0.0.0:80                   0.0.0.0:*         users:(("nginx",pid=10491,fd=6),("nginx",pid=10490,fd=6),("nginx",pid=10487,fd=6))
LISTEN    7          10                            *:9000                       *:*         users:(("php-fpm",pid=11084,fd=9),...,("php-fpm",pid=10529,fd=7))

This time, you can see that the listening queues (Send-Q) of Nginx and php-fpm are only 10, while the current listening queue length (Recv-Q) of nginx has reached the maximum value, and php-fpm is also close to the maximum value. Obviously, the length of the socket listening queue is too small and needs to be increased.

Regarding the setting of the socket listening queue length, it can be adjusted through the socket interface in the application program, or it can be configured through kernel options. Let’s continue executing the following commands in Terminal 1 to query the configuration of Nginx and the kernel options for the listening queue length:

# Query Nginx's listening queue length configuration
$ docker exec nginx cat /etc/nginx/nginx.conf | grep backlog
        listen       80  backlog=10;

# Query php-fpm's listening queue length
$ docker exec phpfpm cat /opt/bitnami/php/etc/php-fpm.d/www.conf | grep backlog
; Set listen(2) backlog.
;listen.backlog = 511

# somaxconn is the maximum value of the system-level socket listening queue
$ sysctl net.core.somaxconn
net.core.somaxconn = 10

From the output, it can be seen that both Nginx and somaxconn are configured as 10, while the configuration of php-fpm is only 511, which is obviously too small. Therefore, the optimization method is to increase these three configurations. For example, the queue length of Nginx and php-fpm can be increased to 8192, and somaxconn can be increased to 65536.

Similarly, I have repackaged the optimized Nginx into two Docker images. You can execute the following commands to run them:

# Stop the old containers
$ docker rm -f nginx phpfpm

# Start Nginx and PHP using the new images
$ docker run --name nginx --network host --privileged -itd feisky/nginx-tp:2
$ docker run --name phpfpm --network host --privileged -itd feisky/php-fpm-tp:2

Then switch to Terminal 2 and retest the performance of Nginx:

$ wrk --latency -c 1000 http://192.168.0.30
...
  62247 requests in 10.06s, 18.25MB read
  Non-2xx or 3xx responses: 62247
Requests/sec:   6185.65
Transfer/sec:      1.81MB

The throughput has now increased to 6185, and during testing, if you run netstat -s | grep socket again in Terminal 1, you will find that there is no more socket packet loss.

However, this time, all of Nginx’s responses have failed again, all of which are Non-2xx or 3xx. What’s going on? Let’s check Nginx’s logs in Terminal 1:

$ docker logs nginx --tail 10
2019/03/15 16:52:39 [crit] 15#15: *999779 connect() to 127.0.0.1:9000 failed (99: Cannot assign requested address) while connecting to upstream, client: 192.168.0.2, server: localhost, request: "GET / HTTP/1.1", upstream: "fastcgi://127.0.0.1:9000", host: "192.168.0.30"

You can see that Nginx reports an error that it cannot connect to fastcgi, with the error message “Connect, Cannot assign requested address”. This error message corresponds to the error code EADDRNOTAVAIL, which means the IP address or port number is not available.

In this case, it can only be a port number problem. Next, let’s analyze the port number.

Port Number Optimization #

According to the principle of network sockets, when a client connects to a server, a temporary port number needs to be assigned, and Nginx is the client of PHP-FPM. The range of port numbers is not unlimited, and at most there are only over 60,000.

By executing the following command, you can query the system’s configured range of temporary port numbers:

$ sysctl net.ipv4.ip_local_port_range
net.ipv4.ip_local_port_range=20000 20050

As you can see, the range of temporary ports is only 50, which is obviously too small. The optimization method is easy to think of, just increase this range. For example, you can execute the following command to expand the port number range to “10000 65535”:

$ sysctl -w net.ipv4.ip_local_port_range="10000 65535"
net.ipv4.ip_local_port_range = 10000 65535

After the optimization is complete, switch back to Terminal 2 and test the performance again:

$ wrk --latency -c 1000 http://192.168.0.30/

...
  32308 requests in 10.07s, 6.71MB read
  Socket errors: connect 0, read 2027, write 0, timeout 433
  Non-2xx or 3xx responses: 30
Requests/sec:   3208.58
Transfer/sec:    682.15KB

This time, there are fewer abnormal responses, but the throughput has also dropped to 3208. Moreover, there were many Socket read errors this time. Obviously, further optimization is needed.

Flame Graph #

Previously, we have optimized many configurations. These configurations optimize the network but may also increase resource usage. Does this suggest that other resources have encountered bottlenecks?

Let’s execute the following command in Terminal 2 to restart the long-term test:

# Test time: 30 minutes
$ wrk --latency -c 1000 -d 1800 http://192.168.0.30

Then, switch back to Terminal 1 and execute top to observe CPU and memory usage:

$ top
...
%Cpu0  : 30.7 us, 48.7 sy,  0.0 ni,  2.3 id,  0.0 wa,  0.0 hi, 18.3 si,  0.0 st
%Cpu1  : 28.2 us, 46.5 sy,  0.0 ni,  2.0 id,  0.0 wa,  0.0 hi, 23.3 si,  0.0 st
KiB Mem :  8167020 total,  5867788 free,   490400 used,  1808832 buff/cache
KiB Swap:        0 total,        0 free,        0 used.  7361172 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
20379 systemd+  20   0   38068   8692   2392 R  36.1  0.1   0:28.86 nginx
20381 systemd+  20   0   38024   8700   2392 S  33.8  0.1   0:29.29 nginx
 1558 root      20   0 1118172  85868  39044 S  32.8  1.1  22:55.79 dockerd
20313 root      20   0   11024   5968   3956 S  27.2  0.1   0:22.78 docker-containe
13730 root      20   0       0      0      0 I   4.0  0.0   0:10.07 kworker/u4:0-ev

From the top results, it can be seen that there is still plenty of available memory, but the system CPU usage (sy) is relatively high. The system CPU usage for both CPUs is close to 50%, and the idle CPU usage is only 2%. Looking at the processes, CPU is mainly occupied by two Nginx processes and two Docker-related processes, with usage of around 30% each.

With CPU usage increased, how should we analyze it? I believe you still remember the perf tool that we have used several times, along with the introduction of flame graphs in the previous sections. It’s easy to identify the hot functions in the system.

Let’s keep wrk running in Terminal 2, and in Terminal 1, execute perf and the flamegraph script to generate a flame graph:

# Execute perf to record events
$ perf record -g

# Switch to the FlameGraph installation directory and execute the following command to generate the flame graph
$ perf script -i ~/perf.data | ./stackcollapse-perf.pl --all | ./flamegraph.pl > nginx.svg

Then, open the generated nginx.svg file with a browser, and you will see the following flame graph:

According to the principle of flame graphs we discussed, this graph should be analyzed from bottom to top, along the widest function in the call stack, to analyze the function with the most executions.

The stack in the middle, do_syscall_64, tcp_v4_connect, inet_hash_connect, is obviously the most important area to focus on. inet_hash_connect() is the function responsible for allocating temporary port numbers in the Linux kernel. So, the bottleneck should still be in the allocation of temporary ports.

In the previous step of “port number” optimization, the range of temporary port numbers has been optimized to be “10000 65535”. This is obviously a very large range, so why did port number allocation become a bottleneck?

It’s okay if you can’t think of it for a while. Let’s temporarily set it aside and look at the impact of other factors. Following inet_hash_connect up the stack, the next hotspot is the __init_check_established function. The purpose of this function is to check if the port number is available. Based on this, you should be able to imagine that if a large number of connections occupy the ports, the function to check if the port number is available will consume more CPU, right?

Is this actually the case? We can continue to run the ss command in Terminal 1 to view the connection status statistics:

$ ss -s
TCP:   32775 (estab 1, closed 32768, orphaned 0, synrecv 0, timewait 32768/0), ports 0
...

This time, we can see that there are a large number of connections (32768 here) in the timewait state, and connections in the timewait state will continue to occupy port numbers. If these port numbers can be reused, then the __init_check_established process can naturally be shortened. In the Linux kernel, there happens to be a tcp_tw_reuse option to control the reuse of port numbers.

In Terminal 1, run the following command to check its configuration:

$ sysctl net.ipv4.tcp_tw_reuse
net.ipv4.tcp_tw_reuse = 0

You can see that tcp_tw_reuse is 0, which means it is disabled. Actually, when we see this, we can understand why the allocation of temporary port numbers has become a hot spot in system operation. Of course, the optimization method is also easy: just set it to 1 to enable it.

I have also packaged the optimized application into two Docker images. You can run the following command to run them:

# Stop the old containers
$ docker rm -f nginx phpfpm

# Start Nginx and PHP with the new images
$ docker run --name nginx --network host --privileged -itd feisky/nginx-tp:3
$ docker run --name phpfpm --network host --privileged -itd feisky/php-fpm-tp:3

After the containers are started, switch to Terminal 2 and test the optimized effect again:

$ wrk --latency -c 1000 http://192.168.0.30/
...
  52119 requests in 10.06s, 10.81MB read
  Socket errors: connect 0, read 850, write 0, timeout 0
Requests/sec:   5180.48
Transfer/sec:      1.07MB

Now, throughput has reached over 5000, and there are only a few Socket errors, and no longer any Non-2xx or 3xx responses. This means everything is finally working fine.

At the end of the case, don’t forget to execute the following command to delete the case application:

# Stop the nginx and phpfpm containers
$ docker rm -f nginx phpfpm

Summary #

Today, I took you through the analysis methods for service throughput decline. In fact, from this case, you can also see that performance problem analysis always relies on the principles of the system and the application.

In practice, mastering and applying these principles is the key to analyzing performance bottlenecks.

First, use various performance tools to collect the desired performance metrics to get a clear understanding of the system and the application’s running status.
Second, compare the current state with the system principles. The inconsistencies are the objects we need to focus on for analysis.

From this perspective, with the help of dynamic tracing tools such as perf, Flame Graph, and BCC, we can identify hot functions and locate the sources of bottlenecks, thus determining the corresponding optimization methods.

Reflection #

Finally, I would like to invite you to discuss the issue of decreasing throughput that you have encountered. How did you analyze its root causes? And how did you solve it? Feel free to combine your account with mine to summarize your own train of thought.

Please feel welcome to discuss it with me in the comment section, and also feel free to share this article with your colleagues and friends. Let’s practice in real scenarios and make progress through communication.