42 Case Study Optimizing NAT Performance Part Two #

Hello, I’m Ni Pengfei.

In the previous section, we learned about the principles of NAT and how to manage NAT rules in Linux. Let’s do a quick review.

NAT technology allows the rewriting of the source IP or destination IP of IP packets, making it a common solution for the shortage of public IP addresses. It enables multiple hosts in a network to access external resources through sharing a single public IP address. Additionally, NAT also provides security isolation for machines in a LAN by hiding the internal network.

NAT in Linux is implemented based on the kernel’s connection tracking module. Therefore, while maintaining the state of each connection, it also has a certain impact on network performance. So, what should we do when encountering NAT performance issues?

Next, I will guide you through a case study to learn the analytical approach to NAT performance problems.

Case preparation #

The following case is still based on Ubuntu 18.04 and is also applicable to other Linux systems. The environment I used for the case is as follows:

Machine configuration: 2 CPUs, 8GB memory.
Pre-installation of tools such as docker, tcpdump, curl, ab, SystemTap, etc., for example:

Ubuntu #

$ apt-get install -y docker.io tcpdump curl apache2-utils

CentOS #

$ curl -fsSL https://get.docker.com | sh $ yum install -y tcpdump curl httpd-tools

You should be familiar with most of the tools. Here I’ll give a brief introduction to SystemTap.

SystemTap is a dynamic tracing framework for Linux. It converts user-provided scripts into kernel modules to execute and is used to monitor and trace the behavior of the kernel. You don’t need to delve deep into its principles for now. I’ll introduce more about it later. Here, you just need to know how to install it:

# Ubuntu
apt-get install -y systemtap-runtime systemtap
# Configure ddebs source
echo "deb http://ddebs.ubuntu.com $(lsb_release -cs) main restricted universe multiverse
deb http://ddebs.ubuntu.com $(lsb_release -cs)-updates main restricted universe multiverse
deb http://ddebs.ubuntu.com $(lsb_release -cs)-proposed main restricted universe multiverse" | \
sudo tee -a /etc/apt/sources.list.d/ddebs.list
# Install dbgsym
apt-key adv --keyserver keyserver.ubuntu.com --recv-keys F2EDC64DC5AEE1F6B9C621F0C8CAB6595FDFF622
apt-get update
apt install ubuntu-dbgsym-keyring
stap-prep
apt-get install linux-image-`uname -r`-dbgsym

# CentOS
yum install systemtap kernel-devel yum-utils kernel
stab-prep

This case is still about the most common Nginx, and it will be stress tested using ab as its client. Two virtual machines are used in the case, and I’ve drawn a diagram to represent their relationship.

Next, we open two terminals and log in to the two machines via SSH respectively (the following steps assume the terminal number is consistent with the VM number in the diagram) and install the mentioned tools. Note that curl and ab only need to be installed on the client VM (i.e., VM2).

Just like in previous cases, all the commands below are assumed to be executed as the root user. If you’re logging into the system as a regular user, run the sudo su root command to switch to the root user.

If you encounter any problems during the installation process, I encourage you to search for a solution on your own first. If you still can’t solve it, you can ask me in the comments section. If you have installed them before, you can ignore this point.

Next, let’s move on to the case.

Case Study #

To compare the performance issues caused by NAT, we first run an Nginx service without NAT and test its performance using ab.

In terminal one, run the following command to start Nginx, where the option –network=host indicates that the container uses the host network mode, which means NAT is not used:

$ docker run --name nginx-hostnet --privileged --network=host -itd feisky/nginx:80

Then, in terminal two, execute the curl command to confirm that Nginx is started successfully:

$ curl http://192.168.0.30/
...
<p><em>Thank you for using nginx.</em></p>
</body>
</html>

Continue in terminal two and execute the ab command to stress test Nginx. However, before the test, take note that Linux allows a relatively small number of open file descriptors by default. For example, on my machine, this value is only 1024:

# open files
$ ulimit -n
1024

Therefore, before performing the ab command, increase this option, such as setting it to 65536:

# temporarily increase the maximum file descriptor number for the current session
$ ulimit -n 65536

Next, execute the ab command to perform the stress test:

# -c represents the concurrency level of 5000, -n represents the total number of requests as 100,000
# -r represents continuing on socket receive errors, -s represents setting the timeout for each request as 2s
$ ab -c 5000 -n 100000 -r -s 2 http://192.168.0.30/
...
Requests per second:    6576.21 [#/sec] (mean)
Time per request:       760.317 [ms] (mean)
Time per request:       0.152 [ms] (mean, across all concurrent requests)
Transfer rate:          5390.19 [Kbytes/sec] received

Connection Times (ms)
                  min  mean[+/-sd] median   max
Connect:        0  177 714.3      9    7338
Processing:     0   27  39.8     19     961
Waiting:        0   23  39.5     16     951
Total:          1  204 716.3     28    7349
...

Regarding the interpretation of the ab output, I have already explained it in the article How to evaluate network performance of a system, so please review it if you have forgotten. From this output, you can see the following:

Requests per second: 6576
Time per request: 760ms
Connect time: 177ms

Remember these values, as they will serve as the baseline metrics for the rest of the case study.

Note that the results on your machine may be different from mine, but it doesn’t matter as it won’t affect the subsequent case study.

Next, go back to terminal one and stop the Nginx application without using NAT:

$ docker rm -f nginx-hostnet

Then, execute the following command to start today’s case application. The case application listens on port 8080 and uses DNAT to map the host’s port 8080 to the container’s port 8080:

$ docker run --name nginx --privileged -p 8080:8080 -itd feisky/nginx:nat

After Nginx starts, you can execute the iptables command to confirm that the DNAT rule has been created:

$ iptables -nL -t nat
Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination
DOCKER     all  --  0.0.0.0/0            0.0.0.0/0            ADDRTYPE match dst-type LOCAL

...

Chain DOCKER (2 references)
target     prot opt source               destination
RETURN     all  --  0.0.0.0/0            0.0.0.0/0
DNAT       tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:8080 to:172.17.0.2:8080

You can see that in the PREROUTING chain, any request with a destination of localhost is redirected to the DOCKER chain. In the DOCKER chain, any tcp request with a destination port of 8080 is DNATed to port 8080 of IP address 172.17.0.2. Here, 172.17.0.2 is the IP address of the Nginx container.

Next, switch to terminal two and execute the curl command to confirm that Nginx has started successfully.

$ curl http://192.168.0.30:8080/
...
<p><em>Thank you for using nginx.</em></p>
</body>
</html>

Then, execute the ab command again, but this time, change the port number in the request to 8080:

# -c indicates the number of concurrent requests is 5000, -n indicates the total number of requests is 100,000
# -r indicates that the requests should continue even if socket errors occur, -s sets the timeout for each request to 2s
$ ab-c 5000 -n 100000 -r -s 2 http://192.168.0.30:8080/
...
apr_pollset_poll: The timeout specified has expired (70007)
Total of 5602 requests completed

As expected, the ab command that was running fine earlier has now failed and reported a connection timeout error. The -s parameter used when running ab sets the timeout for each request to 2s, and from the output, it can be seen that only 5602 requests were completed this time.

Since we’re trying to get the test results from ab, let’s try increasing the timeout, maybe to 30s. To get the results faster, we can also reduce the total number of tests to 10,000:

$ ab -c 5000 -n 10000 -r -s 30 http://192.168.0.30:8080/
...
Requests per second:    76.47 [#/sec] (mean)
Time per request:       65380.868 [ms] (mean)
Time per request:       13.076 [ms] (mean, across all concurrent requests)
Transfer rate:          44.79 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0 1300 5578.0      1   65184
Processing:     0 37916 59283.2      1  130682
Waiting:        0    2   8.7      1     414
Total:          1 39216 58711.6   1021  130682
...

Looking at the output of ab again, the results show:

Requests per second: 76
Time per request: 65s
Connection delay (Connect): 1300ms

Clearly, all the metrics are much worse than before.

So, what would you do if you encountered this problem? You can analyze it yourself based on the previous explanation and continue learning from the following content.

In the previous section, we used tcpdump to capture packets and find the root cause of the increased latency. In today’s case, we can still use a similar method to look for clues. However, this time let’s try a different approach, as we already know the root cause of the problem - NAT.

Recalling the flow of network packets and the principles of NAT in Netfilter, you will find that at least two steps are required to ensure that NAT works correctly:

First, use the hook function in Netfilter to modify the source address or destination address.
Second, use the conntrack module to associate requests and responses of the same connection.

Could there be a problem in either of these two places? Let’s try using the dynamic tracing tool SystemTap to find out.

Since today’s case is in a stress test scenario with significantly reduced concurrent requests, and we know for sure that NAT is the culprit, we have reason to suspect that packet loss has occurred in the kernel.

We can go back to Terminal 1 and create a script file called dropwatch.stp, and write the following content:

#! /usr/bin/env stap

############################################################
# Dropwatch.stp
# Author: Neil Horman <email protected>
# An example script to mimic the behavior of the dropwatch utility
# http://fedorahosted.org/dropwatch
############################################################

# Array to hold the list of drop points we find
global locations

# Note when we turn the monitor on and off
probe begin { printf("Monitoring for dropped packets\n") }
probe end { printf("Stopping dropped packet monitor\n") }

# increment a drop counter for every location we drop at
probe kernel.trace("kfree_skb") { locations[$location] <<< 1 }

# Every 5 seconds report our drop locations
probe timer.sec(5)
{
  printf("\n")
  foreach (l in locations-) {
    printf("%d packets dropped at %s\n",
           @count(locations[l]), symname(l))
  }
  delete locations
}

Tracking packet drops in the kernel #

This script tracks the calls to the kernel function kfree_skb() and counts the locations where packets are dropped. After saving the file, you can run the drop tracking script using the following stap command. stap is a command-line tool for SystemTap:

$ stap --all-modules dropwatch.stp
Monitoring for dropped packets

When you see the “Monitoring for dropped packets” output from the probe begin, it means that SystemTap has compiled the script into a kernel module and started running it.

Next, switch to terminal 2 and run the ab command again:

$ ab -c 5000 -n 10000 -r -s 30 http://192.168.0.30:8080/

Then, go back to terminal 1 and observe the output of the stap command:

10031 packets dropped at nf_hook_slow
676 packets dropped at tcp_v4_rcv

7284 packets dropped at nf_hook_slow
268 packets dropped at tcp_v4_rcv

You will notice that a large number of packet drops occur at the nf_hook_slow location. When you see this name, you should be able to deduce that there is a packet drop issue in the Netfilter Hook’s hook function. However, it is not yet determined if it is NAT. To further track the execution of nf_hook_slow, we can use perf.

Switch back to terminal 2 and run the ab command again:

$ ab -c 5000 -n 10000 -r -s 30 http://192.168.0.30:8080/

Then, switch back to terminal 1 and execute the perf record and perf report commands:

# Record for a while (e.g., 30s), then press Ctrl+C to stop
$ perf record -a -g -- sleep 30

# Generate the report
$ perf report -g graph,0

In the perf report interface, enter the search command /, then in the pop-up dialog, enter nf_hook_slow. Finally, expand the call stack to get the following call graph:

Call Graph

From this graph, we can see that there are three places where nf_hook_slow is called the most: ipv4_conntrack_in, br_nf_pre_routing, and iptable_nat_ipv4_in. In other words, nf_hook_slow primarily performs three actions:

When receiving network packets, it looks up the connection in the connection tracking table and allocates a tracking object (bucket) for new connections.
It forwards packets in the Linux bridge. This is because Nginx in the case study is a Docker container, and the container’s network is implemented using a bridge.
When receiving network packets, it performs DNAT, which forwards packets received on port 8080 to the container.

At this point, we have identified three sources of performance degradation. These three sources are all mechanisms in the Linux kernel, so the next step is to optimize from within the kernel.

Based on the content of various resource modules we covered earlier, we know that the Linux kernel provides users with a large number of configurable options. These options can be viewed and modified through the proc file system or sys file system. In addition, you can use the sysctl command-line tool to view and modify kernel configurations.

For example, today’s topic is DNAT, and the foundation of DNAT is conntrack, so let’s see what conntrack configuration options the kernel provides.

Continue executing the following command in terminal 1:

$ sysctl -a | grep conntrack
net.netfilter.nf_conntrack_count = 180
net.netfilter.nf_conntrack_max = 1000
net.netfilter.nf_conntrack_buckets = 65536
net.netfilter.nf_conntrack_tcp_timeout_syn_recv = 60
net.netfilter.nf_conntrack_tcp_timeout_syn_sent = 120
net.netfilter.nf_conntrack_tcp_timeout_time_wait = 120
...

You can see that the most important three metrics are:

net.netfilter.nf_conntrack_count, which represents the current number of connection tracking entries.
net.netfilter.nf_conntrack_max, which represents the maximum number of connection tracking entries.
net.netfilter.nf_conntrack_buckets, which represents the size of the connection tracking table.

Therefore, this output tells us that the current number of connection tracking entries is 180, the maximum number is 1000, and the size of the connection tracking table is 65536.

Recall the previous ab command; the concurrent request is 5000, and the total number of requests is 100000. Obviously, setting the tracking table to only record 1000 connections is far from sufficient.

In fact, when the kernel is working abnormally, it logs the exception information. For example, in the previous ab test, the kernel has already logged the error “nf_conntrack: table full”. You can see it by running the dmesg command:

$ dmesg | tail
[104235.156774] nf_conntrack: nf_conntrack: table full, dropping packet
[104243.800401] net_ratelimit: 3939 callbacks suppressed
[104243.800401] nf_conntrack: nf_conntrack: table full, dropping packet
[104262.962157] nf_conntrack: nf_conntrack: table full, dropping packet

其中，net_ratelimit 表示有大量的日志被压缩掉了，这是内核预防日志攻击的一种措施。而当你看到 “nf_conntrack: table full” 的错误时，就表明 nf_conntrack_max 太小了。

那是不是，直接把连接跟踪表调大就可以了呢？调节前，你先得明白，连接跟踪表，实际上是内存中的一个哈希表。如果连接跟踪数过大，也会耗费大量内存。

其实，我们上面看到的 nf_conntrack_buckets，就是哈希表的大小。哈希表中的每一项，都是一个链表（称为 Bucket），而链表长度，就等于 nf_conntrack_max 除以 nf_conntrack_buckets。

比如，我们可以估算一下，上述配置的连接跟踪表占用的内存大小：

# 连接跟踪对象大小为376，链表项大小为16
nf_conntrack_max*连接跟踪对象大小+nf_conntrack_buckets*链表项大小 
= 1000*376+65536*16 B
= 1.4 MB

接下来，我们将 nf_conntrack_max 改大一些，比如改成 131072（即 nf_conntrack_buckets 的2倍）：

$ sysctl -w net.netfilter.nf_conntrack_max=131072
$ sysctl -w net.netfilter.nf_conntrack_buckets=65536

然后再切换到终端二中，重新执行 ab 命令。注意，这次我们把超时时间也改回原来的 2s：

$ ab -c 5000 -n 100000 -r -s 2 http://192.168.0.30:8080/
...
Requests per second:    6315.99 [#/sec] (mean)
Time per request:       791.641 [ms] (mean)
Time per request:       0.158 [ms] (mean, across all concurrent requests)
Transfer rate:          4985.15 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0  355 793.7     29    7352
Processing:     8  311 855.9     51   14481
Waiting:        0  292 851.5     36   14481
Total:         15  666 1216.3    148   14645

果然，现在你可以看到：

每秒请求数（Requests per second）为 6315（不用NAT时为6576）；
每个请求的延迟（Time per request）为 791ms（不用NAT时为760ms）；
建立连接的延迟（Connect）为 355ms（不用NAT时为177ms）。

这个结果，已经比刚才的测试好了很多，也很接近最初不用 NAT 时的基准结果了。

不过，你可能还是很好奇，连接跟踪表里，到底都包含了哪些东西？这里的东西，又是怎么刷新的呢？

实际上，你可以用 conntrack 命令行工具，来查看连接跟踪表的内容。比如：

# -L表示列表，-o表示以扩展格式显示
$ conntrack -L -o extended | head
ipv4     2 tcp      6 7 TIME_WAIT src=192.168.0.2 dst=192.168.0.96 sport=51744 dport=8080 src=172.17.0.2 dst=192.168.0.2 sport=8080 dport=51744 [ASSURED] mark=0 use=1
ipv4     2 tcp      6 6 TIME_WAIT src=192.168.0.2 dst=192.168.0.96 sport=51524 dport=8080 src=172.17.0.2 dst=192.168.0.2 sport=8080 dport=51524 [ASSURED] mark=0 use=1

从这里你可以发现，连接跟踪表里的对象，包括了协议、连接状态、源IP、源端口、目的IP、目的端口、跟踪状态等。由于这个格式是固定的，所以我们可以用 awk、sort 等工具，对其进行统计分析。

比如，我们还是以 ab 为例。在终端二启动 ab 命令后，再回到终端一中，执行下面的命令：

# 统计总的连接跟踪数
$ conntrack -L -o extended | wc -l
14289

# 统计TCP协议各个状态的连接跟踪数
$ conntrack -L -o extended | awk '/^.*tcp.*$/ {sum[$6]++} END {for(i in sum) print i, sum[i]}'
SYN_RECV 4
CLOSE_WAIT 9
ESTABLISHED 2877
FIN_WAIT 3
SYN_SENT 2113
TIME_WAIT 9283

# 统计各个源IP的连接跟踪数
$ conntrack -L -o extended | awk '{print $7}' | cut -d "=" -f 2 | sort | uniq -c | sort -nr | head -n 10
  14116 192.168.0.2
    172 192.168.0.96

这里统计了总连接跟踪数，TCP协议各个状态的连接跟踪数，以及各个源IP的连接跟踪数。你可以看到，大部分 TCP 的连接跟踪，都处于 TIME_WAIT 状态，并且它们大都来自于 192.168.0.2 这个 IP 地址（也就是运行 ab 命令的 VM2）。

这些处于 TIME_WAIT 的连接跟踪记录，会在超时后清理，而默认的超时时间是 120s，你可以执行下面的命令来查看：

$ sysctl net.netfilter.nf_conntrack_tcp_timeout_time_wait
net.netfilter.nf_conntrack_tcp_timeout_time_wait = 120

所以，如果你的连接数非常大，确实也应该考虑，适当减小超时时间。

除了上面这些常见配置，conntrack 还包含了其他很多配置选项，你可以根据实际需要，参考 nf_conntrack 的文档来配置。

Summary #

Today, I taught you how to troubleshoot and optimize performance issues caused by NAT.

Since NAT is implemented based on the connection tracking mechanism of the Linux kernel, when analyzing NAT performance issues, we can start by analyzing conntrack. For example, we can use tools like systemtap and perf to analyze the conntrack operations in the kernel. Then, we can optimize by adjusting the parameters of netfilter kernel options.

In fact, this NAT implementation on Linux, which is based on the connection tracking mechanism, is also often referred to as stateful NAT, and maintaining state introduces high performance costs.

Therefore, in addition to adjusting kernel behavior, in scenarios where state tracking is not necessary (such as when only mapping based on predefined IP and port without dynamic mapping), we can also use stateless NAT (such as using tc or developing based on DPDK) to further improve performance.

Reflection #

Finally, here’s a thought-provoking question for you. Have you ever encountered performance issues caused by NAT? How did you identify and analyze the root cause? And in the end, how did you optimize and resolve it? Feel free to combine today’s case study and summarize your thoughts.

Please feel free to discuss with me in the comments section, and also welcome you to share this article with your colleagues and friends. Let’s practice in real-world scenarios and progress through communication.