43 How Tos Several Thoughts on Network Performance Optimization Part One

43 How-tos Several Thoughts on Network Performance Optimization Part One #

Hello, I’m Ni Pengfei.

In the previous section, we learned about the principles of Network Address Translation (NAT), learned how to troubleshoot performance issues caused by NAT, and finally summarized the basic approaches to optimizing NAT performance. Let me give you a brief review.

NAT, based on the connection tracking mechanism of the Linux kernel, implements the functionality of rewriting IP addresses and port numbers, mainly used to address the shortage of public IP addresses.

When analyzing NAT performance issues, you can start by analyzing the kernel’s connection tracking module (conntrack), using tools like systemtap, perf, netstat, and kernel options in the proc file system to analyze the behavior of the network protocol stack. Then, through kernel option tuning, switching to stateless NAT, using DPDK, and other methods, actual optimizations can be carried out.

From the previous learning, you should have realized that networking issues are more complex than the CPU, memory, or disk I/O that we have learned before. Whether it is the various I/O models at the application layer, the lengthy network protocol stack, numerous kernel options, or various complex network environments, all of these increase the complexity of networking.

However, don’t worry too much. As long as you master the basic principles of Linux networking and the working processes of common network protocols, combined with performance indicators of each network layer for analysis, you will find that identifying network bottlenecks is not difficult.

Once you have identified network performance bottlenecks, the next step is optimization—how to reduce network latency and improve network throughput. After learning the relevant principles and cases, I will talk about the approaches and some considerations for optimizing network performance issues.

Due to the extensive content of network optimization approaches, we will divide it into two sections for learning. Today, let’s start with the first part.

Define Optimization Goals #

Just like optimizing CPU and I/O performance, before optimizing network performance, I must first ask myself: what are the goals of network performance optimization? In other words, what level of network performance indicators do we want to achieve?

In fact, although the overall goal of network performance optimization is to reduce network latency (such as RTT) and improve throughput (such as BPS and PPS), the specific optimization standards for each metric may vary, and the priority order may also vary greatly.

Take the NAT gateway mentioned in the previous section, for example. Since it directly affects the network ingress and egress performance of the entire data center, NAT gateways usually need to achieve or approach linear forwarding, which means that PPS is the main performance goal.

For systems like databases and caches, completing network transmission quickly, i.e., low latency, is the primary performance goal.

For web services that we frequently access, both throughput and latency need to be taken into account.

Therefore, in order to evaluate the optimization effect more objectively and reasonably, we should first clarify the optimization standards and perform benchmark tests on the system and applications to obtain the baseline performance of each layer of the network protocol stack.

In How to Evaluate the Network Performance of a System, I have already introduced the methods for network performance testing. Let’s briefly review them. The Linux network protocol stack is the core principle we need to grasp. It is a layered structure based on the TCP/IP protocol family, and I’ll use a picture to represent this structure.

Understanding this, when conducting benchmark tests, we can test each layer of the protocol stack. Since the lower layers form the foundation for the higher layers, the performance of the lower layers determines the performance of the higher layers. Therefore, we need to understand that the performance indicators of the lower layers are actually the performance limits for the higher layers. Let’s understand this from bottom to top.

First, we have the network interface layer and the network layer, which are mainly responsible for packet encapsulation, addressing, routing, sending, and receiving. The number of packets per second (PPS) that can be processed is the most important performance metric for them (especially in the case of small packets). You can use the built-in packet generation tool pktgen to test the performance of PPS.

Moving up to the transport layer of TCP and UDP, they are mainly responsible for network transmission. For them, throughput (BPS), number of connections, and latency are the most important performance indicators. You can use tools like iperf or netperf to test the performance of the transport layer.

However, it should be noted that the size of the network packets will directly affect the values of these indicators. Therefore, usually, you need to test the performance of a series of network packets of different sizes.

Finally, when we reach the application layer, the most important indicators to focus on are throughput (BPS), number of requests per second, and latency. You can use tools like wrk, ab, etc., to test the performance of the application.

However, here’s something to note: the test scenarios should try to simulate the production environment as much as possible to make the tests more valuable. For example, you can record the actual request patterns in the production environment and then replay them in the test environment.

In conclusion, based on these benchmark indicators, combined with the observed performance bottlenecks, we can clearly define the goals of performance optimization.

Network Performance Tools #

Just like before, I suggest organizing and memorizing network-related performance tools from two different perspectives: metrics and tools.

From the perspective of network performance metrics, you will be able to associate performance tools with the workings of the system, giving you a macro understanding and grasp of performance issues. This way, when you want to view a specific performance metric, you will know which tools to use.

Here, I have created a table of tools that provide network performance metrics, making it easier for you to organize the relationships and understand and memorize them. You can save and print it out for easy reference. Of course, you can also use it as a guide for “metric tools”.

Now let’s look at the second perspective, starting from the performance tools. This will allow you to quickly get started with using the tools and swiftly identify the performance metrics you want to observe. Especially in situations where tools are limited, we must make the most of the tools at hand, using a small number of tools to extract a large amount of information.

Similarly, I have also compiled a table of commonly used tools, making it easy for you to distinguish and understand them. Naturally, you can also use it as a guide for “tool metrics” and refer to it when needed.

Network Performance Optimization #

In general, to optimize network performance, first obtain a network benchmark test report, and then use relevant performance tools to identify network performance bottlenecks. The optimization work that follows will be straightforward.

Of course, as the saying goes, optimizing network performance undoubtedly involves the assistance of the Linux system’s network protocol stack and network transmission process. You can review this knowledge by referring to the following diagram.

Next, let’s take a look at the basic ideas for network performance optimization from several perspectives, including the application, sockets, transport layer, network layer, and link layer.

Application #

Applications typically use socket interfaces for network operations. Since network transmission is usually time-consuming, application optimization mainly focuses on network I/O and optimizing the working mode of the process itself.

In fact, we have already learned about this in the article “C10K and C1000K Review” before. Here’s a brief review.

From the perspective of network I/O, there are two main optimization strategies:

The first is the most commonly used I/O multiplexing technology: epoll, which is mainly used to replace select and poll. This is the key solution to the C10K problem and the default mechanism employed by many network applications.

The second is using Asynchronous I/O (AIO). AIO allows the application to initiate multiple I/O operations without waiting for them to complete. When the I/O completes, the system will notify the application of the results through event notification. However, using AIO is more complex, and you need to carefully handle many edge cases.

Regarding the working mode of the process, there are also two different models for optimization:

The first model is the Main Process + Multiple Worker Processes. In this model, the main process is responsible for managing network connections, while worker processes are responsible for the actual business processing. This is also the most commonly used model.

The second model is the Multi-Process Model with a shared listening port. In this model, all processes listen on the same port and enable the SO_REUSEPORT option, allowing the kernel to distribute the request load to these listening processes.

In addition to network I/O and process working models, optimizing application layer network protocol is also crucial. I have summarized several common optimization methods.

Using persistent connections instead of short-lived connections can significantly reduce the cost of TCP connection establishment. This approach is very effective when there are many requests per second.
Caching non-volatile data in memory or other approaches can reduce the frequency of network I/O and improve the response speed of the application.
Serializing data using protocols such as Protocol Buffer can compress the amount of data transmitted through network I/O, thereby increasing the throughput of the application.
Using DNS caching, prefetching, HTTPDNS, or other approaches can reduce DNS resolution latency and improve the overall speed of network I/O.

Sockets #

Sockets shield the differences between different protocols in the Linux kernel and provide a unified access interface for applications. Each socket has a read and write buffer.

The read buffer caches data sent by the remote end. If the read buffer is full, no new data can be received.
The write buffer caches data to be sent out. If the write buffer is full, the write operation of the application will be blocked.

Therefore, to improve network throughput, you usually need to adjust the sizes of these buffers. For example:

Increase the buffer size of each socket net.core.optmem_max;
Increase the buffer size for socket receive net.core.rmem_max and send net.core.wmem_max;
Increase the buffer size for TCP receive net.ipv4.tcp_rmem and send net.ipv4.tcp_wmem.

As for socket kernel options, I have organized them into a table for your reference when needed:

But there are a few points to note.

The three values of tcp_rmem and tcp_wmem are min, default, and max. The system automatically adjusts the size of TCP receive/send buffer based on these settings.
The three values of udp_mem are min, pressure, and max. The system automatically adjusts the size of UDP send buffer based on these settings.

Of course, the values in the table are only for reference, and the specific values need to be determined based on the actual network conditions. For example, the ideal value for the send buffer size is throughput * latency to achieve maximum network utilization.

In addition, the socket interface also provides some configuration options to modify the behavior of network connections:

Setting TCP_NODELAY for a TCP connection can disable the Nagle algorithm.
Enabling TCP_CORK for a TCP connection allows small packets to be aggregated into larger packets before sending (note that it will block the sending of small packets).
You can adjust the size of the socket send buffer and receive buffer using SO_SNDBUF and SO_RCVBUF, respectively.

Summary #

Today, we have reviewed common methods for optimizing Linux network performance.

When optimizing network performance, you can combine the network protocol stack and the network send/receive process of the Linux system, and then optimize layer by layer from the application, socket, transport layer, network layer to the link layer.

Of course, our analysis and identification of network bottlenecks are also based on these. Once the performance bottleneck is identified, optimization can be done based on the protocol layer where the bottleneck exists. For example, today we learned about optimization ideas for the application and socket:

In the application, the main focus is on optimizing I/O models, working models, and application layer network protocols.
In the socket layer, the main focus is on optimizing the socket buffer size.

For the optimization methods of other network layers, I suggest you think about them first. In the next section, we will summarize them together.

Reflection #

Finally, I would like to invite you to chat and share about how you solve performance issues when encountering them on the internet. You can summarize your approach from the aspects of application programs, sockets, etc., combining with the content discussed today.

Feel free to discuss with me in the comments section, and also feel free to share this article with your colleagues and friends. Let’s practice in practical scenarios and make progress through communication.