48 Case Study the Server Constantly Drops Packets, What Should I Do Part Two

48 Case Study The Server Constantly Drops Packets, What Should I Do Part Two #

Hello, I’m Ni Pengfei.

In the previous section, we learned how to analyze the problem of packet loss in the network, especially from the link layer, network layer, and transport layer of the protocol stack.

However, through the analysis of these layers, we still haven’t found the ultimate performance bottleneck. It seems that we need to continue digging deeper. Today, we will continue to analyze this unresolved case.

Before we start with the following content, you can review the content of the previous lesson and think about it yourself. In addition to the link layer, network layer, and transport layer we mentioned, what other potential problems could cause packet loss?

iptables #

First of all, we need to know that besides the various protocols of the network and transport layers, iptables and the connection tracking mechanism of the kernel can also cause packet loss. So, this is also a factor that we must consider when dealing with packet loss issues.

Let’s first take a look at connection tracking. I have already explained the optimization ideas for connection tracking in the article How to Optimize NAT Performance. To confirm if connection tracking is causing the problem, you only need to compare the current connection tracking count with the maximum connection tracking count.

However, since connection tracking is global in the Linux kernel (not belonging to the network namespace), we need to exit the container terminal and go back to the host to check.

You can execute the exit command in the container terminal, and then execute the command below to check the connection tracking count:

# Execute 'exit' in the container terminal
root@nginx:/# exit
exit

# Query kernel configuration in the host terminal
$ sysctl net.netfilter.nf_conntrack_max
net.netfilter.nf_conntrack_max = 262144
$ sysctl net.netfilter.nf_conntrack_count
net.netfilter.nf_conntrack_count = 182

From here, you can see that the connection tracking count is only 182, while the maximum connection tracking count is 262144. Obviously, the packet loss here cannot be caused by connection tracking.

Next, let’s take a look at iptables. Reviewing the principles of iptables, it is based on the Netfilter framework and filters (such as firewall) and modifies (such as NAT) network packets through a series of rules.

These iptables rules are managed in a series of tables, including filter (for filtering), nat (for NAT), mangle (for modifying packet data), and raw (for raw packets), etc. Each table can include a series of chains for grouping and managing iptables rules.

For packet loss issues, the greatest possibility is that they are being dropped by rules in the filter table. To clarify this, we need to confirm whether the rules with targets like DROP and REJECT, which discard packets, are being executed.

You can list all iptables rules and match them with the characteristics of received and sent packets. However, obviously, if there are many iptables rules, this approach will be inefficient.

Of course, a simpler method is to directly query the statistics of DROP and REJECT rules to see if they are 0. If the statistics value is not 0, then extract the relevant rules for analysis.

We can use the command iptables -nvL to view the statistics of each rule. For example, you can execute the docker exec command below to enter the container terminal, and then execute the iptables command below to see the statistics data of the filter table:

# Execute in the host
$ docker exec -it nginx bash

# Execute in the container
root@nginx:/# iptables -t filter -nvL
Chain INPUT (policy ACCEPT 25 packets, 1000 bytes)
 pkts bytes target     prot opt in     out     source               destination
    6   240 DROP       all  --  *      *       0.0.0.0/0            0.0.0.0/0            statistic mode random probability 0.29999999981

Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination

Chain OUTPUT (policy ACCEPT 15 packets, 660 bytes)
 pkts bytes target     prot opt in     out     source               destination
    6   264 DROP       all  --  *      *       0.0.0.0/0            0.0.0.0/0            statistic mode random probability 0.29999999981

From the output of iptables, you can see that the statistics for the two DROP rules are not 0, and they are located in the INPUT and OUTPUT chains respectively. These two rules are actually the same, and they use the statistic module to drop 30% of packets randomly.

Let’s observe the matching criteria for these rules. 0.0.0.0/0 matches all source and destination IP addresses, which means it drops 30% of packets for all traffic. It seems that this is the cause of the packet loss.

Now that we have identified the cause, the optimization is relatively simple. For example, we can delete these two DROP rules directly. We can execute the following two iptables commands in the container terminal to delete these two DROP rules:

root@nginx:/# iptables -t filter -D INPUT -m statistic --mode random --probability 0.30 -j DROP
root@nginx:/# iptables -t filter -D OUTPUT -m statistic --mode random --probability 0.30 -j DROP

After deleting them, is the problem resolved? We can switch to terminal 2 and re-execute the hping3 command we used earlier to see if it works now:

$ hping3 -c 10 -S -p 80 192.168.0.30
HPING 192.168.0.30 (eth0 192.168.0.30): S set, 40 headers + 0 data bytes
len=44 ip=192.168.0.30 ttl=63 DF id=0 sport=80 flags=SA seq=0 win=5120 rtt=11.9 ms
len=44 ip=192.168.0.30 ttl=63 DF id=0 sport=80 flags=SA seq=1 win=5120 rtt=7.8 ms
...
len=44 ip=192.168.0.30 ttl=63 DF id=0 sport=80 flags=SA seq=9 win=5120 rtt=15.0 ms

--- 192.168.0.30 hping statistic ---
10 packets transmitted, 10 packets received, 0% packet loss
round-trip min/avg/max = 3.3/7.9/15.0 ms

As you can see from the output, there is no packet loss now, and the latency fluctuation is also small. It seems that the packet loss problem has been resolved.

However, so far we have only used the hping3 tool to verify that the Nginx server is listening on port 80 correctly, but we have not accessed the HTTP service of Nginx yet. So, let’s not rush to conclude this optimization and further confirm if Nginx can respond to HTTP requests.

Let’s continue in terminal 2 and execute the following curl command to check the response of Nginx to HTTP requests:

$ curl --max-time 3 http://192.168.0.30
curl: (28) Operation timed out after 3000 milliseconds with 0 bytes received

From the output of curl, you can see that the connection timed out this time. However, we just verified using hping3 that the port is functioning correctly, so could it be that Nginx suddenly crashed?

Let’s run hping3 again to confirm:

$ hping3 -c 3 -S -p 80 192.168.0.30
HPING 192.168.0.30 (eth0 192.168.0.30): S set, 40 headers + 0 data bytes
len=44 ip=192.168.0.30 ttl=63 DF id=0 sport=80 flags=SA seq=0 win=5120 rtt=7.8 ms
len=44 ip=192.168.0.30 ttl=63 DF id=0 sport=80 flags=SA seq=1 win=5120 rtt=7.7 ms

len=44 ip=192.168.0.30 ttl=63 DF id=0 sport=80 flags=SA seq=2 win=5120 rtt=3.6 ms

--- 192.168.0.30 hping statistic ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 3.6/6.4/7.8 ms

Strange, the hping3 results show that port 80 of Nginx is indeed in normal state. What should we do now? Don’t forget, we still have a powerful tool - packet capture. It seems necessary to capture some packets and check.

tcpdump #

Next, let’s switch back to Terminal 1 and execute the following tcpdump command in the container terminal to capture packets on port 80:

root@nginx:/# tcpdump -i eth0 -nn port 80
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes

Then switch to Terminal 2 and run the curl command again:

$ curl --max-time 3 http://192.168.0.30/
curl: (28) Operation timed out after 3000 milliseconds with 0 bytes received

After the curl command finishes, switch back to Terminal 1 and view the tcpdump output:

14:40:00.589235 IP 10.255.255.5.39058 > 172.17.0.2.80: Flags [S], seq 332257715, win 29200, options [mss 1418,sackOK,TS val 486800541 ecr 0,nop,wscale 7], length 0
14:40:00.589277 IP 172.17.0.2.80 > 10.255.255.5.39058: Flags [S.], seq 1630206251, ack 332257716, win 4880, options [mss 256,sackOK,TS val 2509376001 ecr 486800541,nop,wscale 7], length 0
14:40:00.589894 IP 10.255.255.5.39058 > 172.17.0.2.80: Flags [.], ack 1, win 229, options [nop,nop,TS val 486800541 ecr 2509376001], length 0
14:40:03.589352 IP 10.255.255.5.39058 > 172.17.0.2.80: Flags [F.], seq 76, ack 1, win 229, options [nop,nop,TS val 486803541 ecr 2509376001], length 0
14:40:03.589417 IP 172.17.0.2.80 > 10.255.255.5.39058: Flags [.], ack 1, win 40, options [nop,nop,TS val 2509379001 ecr 486800541,nop,nop,sack 1 {76:77}], length 0

After this series of operations, from the output of tcpdump, we can see the following:

The first three packets are normal TCP handshake packets, no problem there.
But the fourth packet is received after 3 seconds, and it’s a FIN packet sent by the client (VM2), indicating that the client’s connection closed.

I believe you can guess based on the 3-second timeout option set by curl that the command timed out and exited.

I’ve represented this process using a TCP interaction flowchart (actually from Wireshark’s Flow Graph), which allows you to clearly see the problem described above:

What’s strange is that we didn’t capture the HTTP GET request sent by curl. So, did the network card drop the packets or did the client not send them at all?

We can re-run the netstat -i command to check if there are any dropped packets on the network card:

root@nginx:/# netstat -i
Kernel Interface table
Iface      MTU    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR Flg
eth0       100      157      0    344 0            94      0      0      0 BMRU
lo       65536        0      0      0 0             0      0      0      0 LRU

From the output of netstat, you can see that the received dropped packets (RX-DRP) is 344, indicating that there were indeed packet drops during reception. However, the problem arises: why didn’t we drop any packets when using hping3, but now we can’t receive them when using GET?

As I mentioned before, when encountering phenomena we don’t understand, it’s always a good idea to look up the principles of the tools and methods. Let’s compare these two tools:

hping3 actually only sends SYN packets.
curl, on the other hand, sends an HTTP GET request after sending the SYN packet.

An HTTP GET request is essentially a TCP packet, but compared to SYN packets, it also carries the data of the HTTP GET.

So, given this comparison, you should be able to understand that this could be caused by a misconfigured MTU. Why?

If you take a closer look at the output of netstat above, the second column is the MTU value of each network card. The MTU of eth0 is only 100, while the default MTU value for Ethernet is 1500, so this 100 seems too small.

Of course, the MTU issue is easily solved by changing it to 1500. Let’s continue in the container terminal and execute the following command to change the MTU of the container’s eth0 to 1500:

root@nginx:/# ifconfig eth0 mtu 1500

After the modification is complete, switch back to Terminal 2 and run the curl command again to confirm if the problem is really resolved:

$ curl --max-time 3 http://192.168.0.30/
<!DOCTYPE html>
<html>
...
<p><em>Thank you for using nginx.</em></p>
</body>
</html>

It’s not easy, but this time we finally see the familiar Nginx response, indicating that the packet drops issue has been completely resolved.

Of course, before we finish the case, don’t forget to stop today’s Nginx application. You can switch back to Terminal 1 and execute the exit command to exit the container terminal:

root@nginx:/# exit
exit

Finally, execute the following docker command to stop and remove the Nginx container:

$ docker rm -f nginx

Summary #

Today, I continued to analyze the problem of network packet loss with you. Especially in cases where packet loss occurs from time to time, locating and optimizing the issue requires our dedicated effort.

The severity of network packet loss is evident. When encountering packet loss issues, we still need to start with the process of network transmission and reception in Linux, and analyze each layer in combination with the principles of the TCP/IP protocol stack.

Reflection #

Lastly, I would like to invite you to discuss the network packet loss issues you have encountered. How did you analyze the root cause? And how did you solve them? You can summarize your own thoughts based on my explanation.

Feel free to discuss with me in the comment section, and also feel free to share this article with your colleagues and friends. Let’s practice in real scenarios and improve through communication.