12 Basics How Various Configurations Affect the Tcp Transmission Process

12 Basics How Various Configurations Affect the TCP Transmission Process #

Hello, I’m Shaw Yafang. In this lesson, let’s talk about the factors that can interfere with the transmission of TCP data.

The process of receiving and sending TCP packets is also a common source of problems. Receiving packets refers to the process of data reaching the network card and then being handled by the application program. Sending packets refers to the process from the application program calling the send function to the packet being sent out from the network card. You may be familiar with some common issues that can arise during the TCP packet receiving and sending process, such as:

  • Too many network card interrupts, consuming too much CPU and causing frequent interruptions to the business.
  • Why can’t the application program send out packets even when write() or send() is called?
  • Why does the application program not receive packets even though the data packet has been received by the network card?
  • Why doesn’t the adjustment of buffer size take effect?
  • Is the packet loss caused by the kernel buffer being full? How can I observe it?

To solve these issues, you need to understand the factors that can affect the TCP packet transmission process. This process involves many configuration options, and many problems arise from mismatches between these configuration options and the business scenarios.

Let’s first take a look at the packet sending process and the configuration options that can affect it.

What factors can affect the sending process of TCP packets? #

Sending Process of TCP Packets

The above diagram shows a simplified process of sending TCP packets. When the application calls the write(2) or send(2) system calls to start sending packets, these system calls will copy the packets from the user buffer to the TCP send buffer (TCP Send Buffer). The size of the TCP send buffer is limited, which can cause problems.

The size of the TCP send buffer is determined by the parameter net.ipv4.tcp_wmem:

net.ipv4.tcp_wmem = 8192 65536 16777216

In tcp_wmem, these three numbers represent min, default, and max. The size of the TCP send buffer can be dynamically adjusted between min and max. The initial size is set to default, and the dynamic adjustment process is automatically performed by the kernel, without the intervention of the application. The purpose of automatic adjustment is to meet the packet sending needs with minimal memory waste.

The value of max in tcp_wmem cannot exceed the value of the net.core.wmem_max configuration item. If it exceeds, the maximum size of the TCP send buffer is limited by net.core.wmem_max. Usually, we need to set net.core.wmem_max to be greater than or equal to the max value in net.ipv4.tcp_wmem:

net.core.wmem_max = 16777216

The size of the TCP send buffer needs to be flexibly adjusted based on the server’s load capacity. In general, we need to increase their default values. The sets of values for tcp_wmem that I listed above are the increased values, which are also the values we configure in the production environment.

I increased these values because we encountered significant latency issues in the production environment due to the TCP send buffer being too small. You can use tools such as SystemTap to observe this problem in the kernel (by observing the sk_stream_wait_memory event):

# sndbuf_overflow.stp
# Usage:
# $ stap sndbuf_overflow.stp
probe kernel.function("sk_stream_wait_memory")
{
    printf("%d %s TCP send buffer overflow\n",
         pid(), execname())
}

If you can observe the sk_stream_wait_memory event, it means that the TCP send buffer is too small, and you need to continue increasing the values of wmem_max and tcp_wmem:max.

Sometimes, the application knows exactly how much data it needs to send and how large the TCP send buffer needs to be. In this case, you can set a fixed buffer size using the SO_SNDBUF option in the setsockopt(2) function. Once this setting is made, tcp_wmem will be deactivated, and the buffer size will remain fixed; the kernel will not dynamically adjust it.

However, the maximum value set by SO_SNDBUF cannot exceed net.core.wmem_max, and if it exceeds, the kernel will forcibly set it to net.core.wmem_max. Therefore, if you want to set SO_SNDBUF, you must make sure that net.core.wmem_max meets your requirements; otherwise, your setting may not take effect. In general, we do not use SO_SNDBUF to set the size of the TCP send buffer, but use the tcp_wmem set by the kernel because setting SO_SNDBUF too large will waste memory, while setting it too small may cause buffer insufficiency.

Additionally, if you have paid attention to the latest Linux technical developments, you must have heard of eBPF. You can also use eBPF to set SO_SNDBUF and SO_RCVBUF, thereby setting the sizes of the TCP send buffer and TCP receive buffer, respectively. Similarly, when using eBPF to set these two buffers, you should not exceed wmem_max and rmem_max. However, when adding the feature to set the buffer sizes, eBPF did not consider the maximum value limitation. I found this issue when using it and submitted a patch to the community to fix it. If you are interested, you can check this link: bpf: sock recvbuf must be limited by rmem_max in bpf_setsockopt().

The sizes of tcp_wmem and wmem_max are set on a per-TCP-connection basis. The units of these two values are in bytes. There may be a large number of TCP connections in the system, and if there are too many connections, it may cause memory exhaustion. Therefore, there are also limitations on the total memory consumed by all TCP connections.

net.ipv4.tcp_mem = 8388608 12582912 16777216

We usually adjust this configuration as well. Unlike the previous two options, the unit for these values in this option is in pages, which is 4K. It also has three values: min, pressure, max. When the total memory consumed by all TCP connections reaches the maximum value (max), packets can no longer be sent out due to the limitation.

We can also observe the issue of packets not being able to be sent or causing jitter due to the tcp_mem limit. In order to conveniently observe such issues, the Linux kernel has a pre-set static observation point: sock_exceed_buf_limit. However, this observation point was initially only used to observe the issue of insufficient buffer when receiving TCP, and not for observing the issue of insufficient buffer when sending TCP. Later, I submitted a patch for improvement, which allows it to be used to observe the issue of insufficient buffer when sending TCP as well: net: expose sk wmem in sock_exceed_buf_limit tracepoint. To observe, you only need to enable the tracepoint (requires kernel version 4.16+):

$ echo 1 > /sys/kernel/debug/tracing/events/sock/sock_exceed_buf_limit/enable

Then check if the event occurs:

$ cat /sys/kernel/debug/tracing/trace_pipe

If there is log output (i.e. the event occurs), it means you need to adjust tcp_mem or disconnect some TCP connections.

After the TCP layer processes the packet, it continues to the IP layer. The problematic configuration option here in the IP layer is net.ipv4.ip_local_port_range, which defines the range of local ports used to establish IP connections with other servers. We have encountered the issue of being unable to create new connections due to the default port range being too small in our production environment. Therefore, we usually expand the default port range:

net.ipv4.ip_local_port_range = 1024 65535

In order to perform flow control on TCP/IP data streams, the Linux kernel implements qdisc (queuing discipline) at the IP layer. The TC tool we usually use is based on qdisc. The queue length of qdisc is seen through the txqueuelen displayed with ifconfig. In our production environment, we have also encountered situations where packets are dropped due to the txqueuelen being too small. This type of issue can be observed with the following command:

$ ip -s -s link ls dev eth0- …- TX: bytes packets errors dropped carrier collsns- 3263284 25060 0 0 0 0

If the dropped value is not 0, it is likely caused by the txqueuelen being too small. When encountering this situation, you need to increase the value, such as increasing the txqueuelen of the eth0 network interface:

$ ifconfig eth0 txqueuelen 2000

or using the ip tool:

$ ip link set eth0 txqueuelen 2000

After adjusting the txqueuelen value, you need to continue monitoring to see if it alleviates the packet loss issue. This also helps you adjust it to a suitable value.

The default qdisc for Linux systems is pfifo_fast (First In, First Out), and usually we do not need to adjust it. If you want to use TCP BBR to improve TCP congestion control, you need to adjust it to fq (fair queue):

net.core.default_qdisc = fq

After the IP layer, the packets go through the network card and are then sent out via the network card. At this point, your data that needs to be sent has completed the TCP/IP protocol stack and is sent to the other side normally.

Next, let’s take a look at how the packets are received and which configuration options affect the receiving process.

What Factors Affect the Reception Process of TCP Packets? #

The reception process of TCP packets can also be represented simply with a diagram:

From the above diagram, we can see that the reception process of TCP packets is similar to the sending process, but in reverse. When a packet arrives at the network card, it triggers an interrupt (IRQ) to inform the CPU to read the packet. However, in high-performance network scenarios, the number of packets can be very large. If an interrupt is generated for each packet, it will significantly reduce CPU efficiency. To address this, the NAPI (New API) mechanism was introduced to allow the CPU to poll (or process) multiple packets at once, improving efficiency and reducing the performance impact of network card interrupts.

But how many packets can be polled at once during the polling process? This number can be controlled through the sysctl option:

net.core.netdev_budget = 600

The default value of this control option is 300. In scenarios with high network throughput, this value can be increased, for example, to 600. By increasing this value, more packets can be processed at once. However, this adjustment also has its drawbacks, as it increases the time the CPU spends polling. If the system is running many tasks, the scheduling delay of other tasks will increase.

Next, let’s continue looking at the reception process of TCP packets. As mentioned earlier, when packets arrive at the network card, the CPU polls the packets, which then proceed to the IP layer for processing and then to the TCP layer. At this point, another problematic area comes into play: the TCP Receive Buffer.

Similar to the TCP send buffer, the size of the TCP receive buffer is also controllable. By default, it is controlled by tcp_rmem. Similarly, we can increase the default values of these parameters for better network performance, adjusting them to the following values:

net.ipv4.tcp_rmem = 8192 87380 16777216

It has three fields: min, default, and max. The size of the TCP receive buffer is dynamically adjusted between min and max. However, unlike the send buffer, this dynamic adjustment can be disabled through the control option tcp_moderate_rcvbuf. Normally, we keep it enabled, which is also the default value:

net.ipv4.tcp_moderate_rcvbuf = 1

The reason why the receive buffer has an option to control auto-adjustment while the send buffer does not is because the TCP receive buffer directly affects TCP congestion control, which in turn affects the behavior of the sender. Therefore, using this control option allows more flexible control over the behavior of the sender.

In addition to tcp_moderate_rcvbuf, which controls the dynamic adjustment of the TCP receive buffer, you can also control it through the SO_RCVBUF configuration option in setsockopt(), similar to the TCP send buffer. If the application sets the SO_RCVBUF flag, the dynamic adjustment of the TCP receive buffer is disabled, and the size of the receive buffer is always set to the value specified by SO_RCVBUF.

In other words, only when tcp_moderate_rcvbuf is set to 1 and the application does not configure the buffer size through SO_RCVBUF, will the TCP receive buffer be dynamically adjusted.

Similarly, the value set by SO_RCVBUF cannot exceed net.core.rmem_max. Typically, net.core.rmem_max needs to be set to a value greater than or equal to the max value in net.ipv4.tcp_rmem.

net.core.rmem_max = 16777216

In our production environment, we have also encountered packet loss issues due to reaching the limit of the TCP receive buffer. However, tracking and diagnosing such issues is not straightforward. There is no intuitive way to trace this behavior. To address this, I added statistics for this behavior in our kernel.

To allow anyone using the Linux kernel to observe this behavior, I contributed our practice to the Linux kernel community. You can see the specific commit here: tcp: add new SNMP counter for drops when try to queue in rcv queue. By using this SNMP counter, we can conveniently use netstat to check if packet loss has occurred due to insufficient TCP receive buffer.

However, this method still has some limitations. If we want to determine which TCP connection is experiencing packet loss, this method is not suitable. In such cases, we need to rely on other more specialized trace tools, such as eBPF, to achieve our goals.

Class Summary #

Alright, that’s all for this class. Let’s briefly recap. TCP/IP is a complex protocol stack, and its packet transmission process is also complex. In this class, we focused on the aspects that are most likely to cause problems during this process. The configuration options we discussed earlier are easy to cause issues in a production environment and must be considered when optimizing high-performance networks. I have summarized these configuration options in a table for your convenience:

These values need to be adjusted flexibly according to your business scenario. When you are unsure about how to adjust them for your business, it’s recommended to consult more professional personnel or observe system and business behavior changes while adjusting.

Homework #

In this lesson, we have two diagrams, which illustrate the process of sending and receiving TCP packets respectively. We can see that qdisc is used in the TCP sending process, but not in the receiving process. Why is that? Can we also use qdisc in the receiving process? Feel free to discuss with me in the comments section.

Thank you for reading. If you found this lesson helpful, please feel free to share it with your friends. See you in the next lecture.