44 How Tos Several Thoughts on Network Performance Optimization Part Two

44 How-tos Several Thoughts on Network Performance Optimization Part Two #

Hello, I’m Ni Pengfei.

In the previous section, we learned about several strategies for network performance optimization. Let’s do a quick review.

When optimizing network performance, you can combine the Linux system’s network protocol stack and network transmission process, and then optimize each layer from the application, socket, transport layer, network layer, and link layer. In the previous section, we mainly focused on optimizing the application and socket, such as:

In the application, we mainly optimized the I/O model, work model, and application layer network protocols.
At the socket layer, we mainly optimized the socket’s buffer size.

Today, we will continue exploring how to optimize Linux network performance from the transport layer, network layer, and link layer, following the TCP/IP network model.

Network Performance Optimization #

Transport Layer #

The most important protocols in the transport layer are TCP and UDP, so the optimization here mainly focuses on these two protocols.

Let’s start with optimizing the TCP protocol.

TCP provides reliable transport service with a connection-oriented approach. To optimize TCP, it is important to understand the basic principles of TCP, such as flow control, slow start, congestion avoidance, delayed acknowledgment, and state diagram (as shown in the following figure).

I won’t go into the details of these principles here. If you haven’t fully grasped them yet, it is recommended to study these basic principles thoroughly before attempting optimization, instead of blindly experimenting.

Once you understand these principles, you can optimize TCP without disrupting its normal operation. Below, I will explain in detail different scenarios.

The first scenario is in situations where there is a large number of connections in TIME_WAIT state. These connections can consume a significant amount of memory and port resources. In this case, you can optimize the kernel options related to TIME_WAIT state, such as taking the following measures:

Increase the maximum number of connections in TIME_WAIT state net.ipv4.tcp_max_tw_buckets, and increase the size of the connection tracking table net.netfilter.nf_conntrack_max.
Decrease net.ipv4.tcp_fin_timeout and net.netfilter.nf_conntrack_tcp_timeout_time_wait to release the resources occupied by these connections as soon as possible.
Enable port reuse net.ipv4.tcp_tw_reuse. This allows the ports occupied by connections in TIME_WAIT state to be used for new connections.
Increase the range of local ports net.ipv4.ip_local_port_range. This supports more connections and improves overall concurrency.
Increase the maximum number of file descriptors. You can increase the maximum number of file descriptors for processes and the system using fs.nr_open and fs.file-max, or configure LimitNOFILE in the systemd configuration file of the application to set the maximum number of file descriptors for the application.

The second scenario is to mitigate performance issues caused by attacks utilizing the characteristics of the TCP protocol, such as SYN FLOOD. In such cases, you can consider optimizing the kernel options related to the SYN state, such as taking the following measures:

Increase the maximum number of half-open connections net.ipv4.tcp_max_syn_backlog, or enable TCP SYN Cookies net.ipv4.tcp_syncookies to bypass the limit on the number of half-open connections (note that these two options cannot be used together).
Reduce the number of retransmissions of SYN+ACK packets in SYN_RECV state net.ipv4.tcp_synack_retries.

The third scenario is in long connection scenarios, where Keepalive is commonly used to detect the status of TCP connections in order to automatically recycle them when the other end disconnects. However, the default values for Keepalive probe interval and retry count in the system usually do not meet the performance requirements of the application. Therefore, in such cases, it is necessary to optimize the kernel options related to Keepalive, such as:

Reduce the interval between the last data packet and the Keepalive probe packet net.ipv4.tcp_keepalive_time.
Reduce the interval between sending Keepalive probe packets net.ipv4.tcp_keepalive_intvl.
Reduce the number of retries until notifying the application after Keepalive probe failure net.ipv4.tcp_keepalive_probes.

After explaining these TCP optimization methods, I have also summarized them into a table for your reference (the values are for reference only, and the specific configurations need to be adjusted based on your actual scenarios):

When optimizing TCP performance, it is important to note that using different optimization methods simultaneously can lead to conflicts.

For example, as we discussed in the case of network latency, enabling Nagle’s algorithm on the server side and enabling delayed acknowledgment mechanism on the client side can easily increase network latency.

Furthermore, on servers using NAT, enabling net.ipv4.tcp_tw_recycle can easily cause various connection failures. In fact, due to numerous pitfalls, this option has been removed in kernel version 4.1.

Now that we have covered TCP, let’s move on to optimizing UDP.

UDP provides a connectionless datagram transport protocol without reliability guarantees. Compared to TCP, UDP optimization is much simpler. Here are a few common optimization approaches:

Increase the socket buffer size and UDP buffer range, as mentioned in the previous section on sockets.
Increase the range of local port numbers, as mentioned in the previous section on TCP.
Adjust the size of UDP packets based on the MTU size to minimize or avoid fragmentation.

Network Layer #

Next, let’s discuss optimization at the network layer.

The network layer is responsible for packet encapsulation, addressing, and routing, including common protocols like IP and ICMP. In the network layer, the main optimizations revolve around routing, IP fragmentation, and ICMP.

The first optimization revolves around routing and forwarding, and you can adjust the following kernel options:

Enable IP forwarding in servers that require forwarding, such as servers used as NAT gateways or when using Docker containers, by setting net.ipv4.ip_forward = 1.

Adjust the Time To Live (TTL) of data packets, for example, set net.ipv4.ip_default_ttl = 64. Note that increasing this value will decrease system performance.
Enable reverse address verification for data packets, for example, set net.ipv4.conf.eth0.rp_filter = 1. This can prevent IP spoofing and reduce DDoS issues caused by forged IP addresses.

The second method, from the perspective of fragmentation, the most important thing is to adjust the size of the Maximum Transmission Unit (MTU).

Usually, the size of the MTU should be set according to the Ethernet standard. The Ethernet standard specifies that a network frame is maximum 1518B, so after removing the 18B Ethernet header, the remaining 1500B is the size of the Ethernet MTU.

When using overlay network technologies such as VXLAN and GRE, it is important to note that network overlay increases the size of the original network packets, which requires adjusting the MTU.

For example, taking VXLAN as an example, it adds a 14B Ethernet header, 8B VXLAN header, 8B UDP header, and 20B IP header to the original packet. In other words, each packet increases in size by 50B.

Therefore, we need to increase the MTU of switches, routers, etc. to 1550, or decrease the MTU of VXLAN encapsulation (such as virtual NIC in virtualized environments) to 1450.

In addition, many network devices now support jumbo frames, in which case you can increase the MTU to 9000 to improve network throughput.

The third method, from the perspective of ICMP, to avoid various network issues such as ICMP host probing and ICMP flood, you can limit the behavior of ICMP through kernel options.

For example, you can disable the ICMP protocol by setting net.ipv4.icmp_echo_ignore_all = 1. This way, external hosts cannot probe the host using ICMP.
Alternatively, you can disable ICMP broadcasts by setting net.ipv4.icmp_echo_ignore_broadcasts = 1.

Link Layer #

The link layer is below the network layer, so let’s look at the optimization methods for the link layer.

The link layer is responsible for the transmission of network packets in the physical network, such as MAC addressing, error detection, and transmission of network frames through NICs. Naturally, link layer optimization revolves around these basic functions. Let’s look at several different aspects.

Since the interrupt handlers called by NICs after receiving packets (especially soft interrupts) consume a lot of CPU resources, scheduling these interrupt handlers to execute on different CPUs can significantly improve network throughput. This can usually be achieved through the following two methods.

For example, you can configure CPU affinity (smp_affinity) for NIC hard interrupts, or enable the irqbalance service.
Another example is to enable Receive Packet Steering (RPS) and Receive Flow Steering (RFS) to schedule the processing of application programs and soft interrupts on the same CPU, which can increase CPU cache hits and reduce network latency.

In addition, modern NICs have rich functionalities, which can offload certain software processing tasks from the kernel to the NIC, allowing them to be executed by hardware.

TCP Segmentation Offload (TSO) and UDP Fragmentation Offload (UFO): These tasks involve directly sending large packets in TCP/UDP protocols, while the segmentation of TCP packets (segmented based on MSS) and fragmentation of UDP (fragmented based on MTU) are handled by the NIC.
Generic Segmentation Offload (GSO): When the NIC does not support TSO/UFO, it offloads the segmentation of TCP/UDP packets to be executed before entering the NIC. This reduces CPU consumption and allows the retransmission of only the segmented packets in case of packet loss.
Large Receive Offload (LRO): When receiving segmented TCP packets, the NIC assembles and merges them before handing them over for further network processing. However, LRO should not be enabled when IP forwarding is required, as inconsistent header information in multiple packets can cause checksum errors in the merged network packet.
Generic Receive Offload (GRO): GRO improves on the drawbacks of LRO and is more versatile, supporting both TCP and UDP.
Receive Side Scaling (RSS): It is also known as multi-queue reception, which assigns multiple CPU cores to handle received network packets based on multiple hardware receive queues.
VXLAN Offload: This offloads the packet encapsulation of VXLAN to the NIC.

Finally, there are also many methods to optimize the throughput of the network interface itself.

For example, you can enable the multi-queue feature of the network interface. This allows each queue to be assigned a different interrupt number and scheduled to execute on different CPUs, thereby improving network throughput.
Another example is to increase the buffer size and queue length of the network interface to enhance network transmission throughput (note that this may increase latency).
You can also use Traffic Control tools to configure Quality of Service (QoS) for different network traffic.

So far, I have introduced optimization methods for network performance from the application, socket, transport layer, network layer, and link layer perspectives. With these optimizations, network performance can meet the requirements of most scenarios.

Finally, don’t forget about a special case. Do you remember the C10M problem we learned about?

In the scenario of 10 million concurrent connections on a single machine, various optimization strategies applied to the Linux network protocol stack have little effect. This is because the lengthy process of the network protocol stack is the main performance burden in this situation.

In this case, there are two ways to optimize it.

The first method is to use DPDK technology to bypass the kernel protocol stack and allow user-level processes to handle network requests in a polling manner. Also, combining mechanisms such as large pages, CPU affinity, memory alignment, and pipelining to optimize the efficiency of packet processing.

The second method is to use the built-in XDP technology in the kernel to process network packets before they enter the kernel protocol stack, which can also achieve good performance.

Summary #

In these two lessons, we have reviewed common methods for optimizing Linux network performance together.

When optimizing network performance, we can combine the network protocol stack and network processing flow of the Linux system and optimize each layer from the application, socket, transport layer, network layer, and link layer.

In fact, when analyzing and locating network bottlenecks, we also rely on these network layers. Once we locate the network performance bottleneck, we can optimize it based on the protocol layer where the bottleneck is located. Specifically:

In the application layer, the focus is on optimizing I/O models, work models, and application layer network protocols.
In the socket layer, the focus is on optimizing socket buffer sizes.
In the transport layer, the focus is on optimizing TCP and UDP protocols.
In the network layer, the focus is on optimizing routing, forwarding, fragmentation, and ICMP protocols.
Finally, in the link layer, the focus is on optimizing packet transmission and reception, network function offloading, and network card settings.

If these methods still cannot meet your requirements, you can consider using user space methods such as DPDK to bypass the kernel protocol stack, or use XDP to process network packets before they enter the kernel protocol stack.

Reflection #

During the study of this section, I have only listed a few common strategies for optimizing network performance. However, there must be many other optimization methods, ranging from application programs to system and network devices. I would like to discuss with you, what other optimization methods do you know?

Feel free to discuss with me in the comments section, and feel free to share this article with your colleagues and friends. Let’s practice in real-world scenarios and make progress through communication.