13 Case Study How Tcp Congestion Control Leads to Business Performance Jitter

13 Case Study How TCP Congestion Control Leads to Business Performance Jitter #

Hello, I am Yao Fang. In this lesson, I am going to share with you the relationship between TCP congestion control and business performance jitter.

TCP congestion control is the core of the TCP protocol and a very complex process. If you don’t understand TCP congestion control, it’s like not understanding the TCP protocol at all. The purpose of this lesson is to use some case studies to introduce some pitfalls to avoid in TCP congestion control and some points to consider when tuning TCP performance.

Because there are many cases that can cause problems in TCP transmission, I won’t analyze them step by step. Instead, I hope to abstract these cases and combine them with specific knowledge points to make it more systematic. And after you understand these knowledge points, the process of analyzing the cases will be relatively simple.

In the first two lessons (Lesson 11 and Lesson 12), we talked about the issues that may need attention from a single machine perspective. However, network transmission is a more complex process, involving more problems and more difficult analysis. Many people may have had the following experiences:

When chatting on WeChat while waiting for the elevator, the WeChat message cannot be sent after entering the elevator.
When playing online games happily with a roommate sharing the same network, suddenly the game becomes very laggy, only to find out that the roommate is downloading a movie.
Uploading a file to a server using FTP takes much longer than expected.
…

In these issues, TCP congestion control is at work.

How does TCP congestion control affect network performance? #

Let’s first take a look at the basic principles of TCP congestion control.

The above image is a simple illustration of TCP congestion control, which can be roughly divided into four stages.

1. Slow start #

After the TCP connection is established, the sender enters the slow start phase and gradually increases the number of packets sent (TCP Segments). In this phase, the number of packets doubles after each round-trip time (RTT), as shown in the following figure:

The initial number of packets is determined by init_cwnd (initial congestion window), which is set to 10 (TCP_INIT_CWND) in the Linux kernel. This value is an empirical value summarized by Google researchers and is also written in RFC6928. In addition, starting from the Linux kernel version 2.6.38, the default value of init_cwnd has been changed from 3 to 10 based on Google’s recommendation. You can check this commit for more details: tcp: Increase the initial congestion window to 10.

Increasing the value of init_cwnd can significantly improve network performance because it allows more TCP Segments to be sent in the initial stage. You can refer to the explanation in RFC6928 for more detailed reasons.

If your kernel version is relatively old (older than the kernel version of CentOS-6), consider increasing the value of init_cwnd to 10. If you want to increase it to a larger value, it is possible, but you need to conduct more experiments based on your network conditions to find an ideal value. Setting the initial congestion window too large may cause a high TCP retransmission rate. Of course, you can also adjust this value more flexibly through the “ip route” method, or even configure it as a sysctl control item.

Increasing the value of init_cwnd is effective in improving the network performance of short connections, especially for short connections with data that can be transmitted in the slow start phase, such as HTTP services. Generally, HTTP short connection requests have a small amount of data and can usually be transmitted in the slow start phase. You can observe these cases using tcpdump.

In the slow start phase, when the congestion window (cwnd) increases to a threshold value (ssthresh, slow start threshold), TCP congestion control enters the next phase: congestion avoidance.

2. Congestion avoidance #

In this phase, cwnd no longer doubles but increases by 1 after each RTT to slow down the increase of cwnd and prevent network congestion. Network congestion is difficult to avoid due to the complexity of network links and can even result in out-of-order packets. One of the reasons for out-of-order packets is illustrated in the following figure:

In the above figure, the sender sends 4 TCP segments at once, but the second segment is lost during transmission, so the receiver does not receive it. However, the third and fourth TCP segments can be received, so they become out-of-order packets, which will be added to the receiver’s out-of-order queue.

Packet loss issues are more likely to occur in mobile network environments, especially in environments with poor network conditions. For example, the packet loss rate will be high in an elevator, which will result in slow network response. Packet loss issues are mainly targeted at services connected to external networks, such as gateway services.

For our gateway services, we have also done some TCP optimization work, mainly optimizing the Cubic congestion control algorithm to alleviate the performance degradation caused by packet loss. In addition, Google’s open-source congestion control algorithm called BBR theoretically can effectively alleviate TCP packet loss issues. However, in our practice, BBR did not perform well, so we did not use it in the end.

Going back to the previous diagram, since the receiver did not receive the second segment, every time the receiver receives a new segment, it sends an acknowledgment (ACK) for the second segment, which is ACK 17. Then, the sender receives three identical ACKs (ACK 17) in a row. When three duplicate ACKs appear, the sender determines that a packet loss has occurred and enters the next stage: fast retransmit.

3. Fast retransmit and fast recovery #

Fast retransmit and fast recovery work together to optimize the behavior of packet loss. In this case, since there is no congestion in the network, the congestion window does not need to be restored to the initial value. The determination of packet loss is based on receiving three identical ACKs. Google engineers have proposed an improvement strategy for TCP Fast Retransmission called tcp early retrans. It allows some TCP connections to bypass the retransmission timeout (RTO) for faster retransmission in certain cases. This feature is supported in kernel versions 3.6 and above. If you are still using CentOS-6, you won’t be able to enjoy the network performance improvement it brings. You may consider upgrading your operating system to CentOS-7 or the latest CentOS-8. By the way, Google’s technical expertise in networking is unparalleled and the maintainer of the Linux kernel TCP subsystem is also a Google engineer (Eric Dumazet).

In addition to fast retransmission, there is another retransmission mechanism called timeout retransmission. However, this is a very bad situation. If a sent packet does not receive an acknowledgement (ack) within a certain time period (RTO), it is considered a network congestion. In this case, the congestion window (cwnd) needs to be restored to its initial value, and the cwnd size is adjusted again from slow start.

RTO usually occurs when there is congestion in the network link. If one connection has too much data, it can cause other connections’ packets to be queued, resulting in significant delay. The example we mentioned earlier, where downloading a movie affects someone playing an online game, is due to this reason.

Regarding RTO, it is also an optimization point. If the RTO is too large, the business may be blocked for a long time. Therefore, in kernel version 3.1, an improvement was introduced to adjust the initial RTO value from 3s to 1s, which can significantly reduce the blocking time of business. However, RTO=1s is still too large in some scenarios, especially in a data center with relatively stable network quality.

We have encountered a case in our production environment where business users reported significant fluctuations in the response time. After preliminary inspection using strace, we found that the process was blocking in functions like send(). Then we used tcpdump to capture packets and found that after the sender sent the data, it did not receive a response from the receiver until the RTO time for retransmission. At the same time, we also tried capturing packets on the receiver side using tcpdump and found that the receiver received the packets after a long time. Therefore, we concluded that there was network congestion, resulting in the receiver not receiving the packets in a timely manner.

So, for situations where network congestion causes excessive blocking time for business, is there any solution? One solution is to create a TCP connection and use SO_SNDTIMEO to set the send timeout to prevent the application from being blocked at the sender for too long when sending packets, as shown below:

ret = setsockopt(sockfd, SOL_SOCKET, SO_SNDTIMEO, &timeout, len);

When the business detects a timeout on this TCP connection, it will actively disconnect the connection and try to use other connections.

This approach allows setting the RTO time for a specific TCP connection. But is there a way to set a global RTO time (set once and apply to all TCP connections)? The answer is yes, but it requires modifying the kernel. For such needs, our practice in the production environment is to change TCP RTO min, TCP RTO max, and TCP RTO init to variables that can be flexibly controlled using sysctl, so that adjustments can be made according to the actual situation. For example, for servers within a data center, we can appropriately reduce these values to reduce blocking time of business.

The aforementioned four stages are the foundation of TCP congestion control. Overall, congestion control is about flexibly adjusting the congestion window (cwnd) based on the data transmission status of TCP to control the data sending behavior of the sender. In other words, the size of the congestion window represents the congestion situation of the network transmission link. The size of TCP connection’s cwnd can be viewed using the “ss” command:

$ ss -nipt
State       Recv-Q Send-Q                        Local Address:Port                                       Peer Address:Port         
ESTAB       0      36                             172.23.245.7:22                                        172.30.16.162:60490      
users:(("sshd",pid=19256,fd=3))
     cubic wscale:5,7 rto:272 rtt:71.53/1.068 ato:40 mss:1248 rcvmss:1248 advmss:1448 cwnd:10 bytes_acked:19591 bytes_received:2817 segs_out:64 segs_in:80 data_segs_out:57 data_segs_in:28 send 1.4Mbps lastsnd:6 lastrcv:6 lastack:6 pacing_rate 2.8Mbps delivery_rate 1.5Mbps app_limited busy:2016ms unacked:1 rcv_space:14600 minrtt:69.402

By using this command, we can see that the cwnd for this TCP connection is 10.

If you want to track real-time changes in the congestion window, there is another better way: by using the “tcp_probe” tracepoint:

/sys/kernel/debug/tracing/events/tcp/tcp_probe

However, this tracepoint is only supported in kernel versions 4.16 and above. If your kernel version is older, you can also use the “tcp_probe” kernel module (net/ipv4/tcp_probe.c) for tracing.

In addition to network conditions, the sender also needs to know the receiver’s processing capability. If the receiver’s processing capability is poor, the sender must slow down its sending rate, otherwise the packets will be congested in the receiver’s buffer and may even be discarded. The receiver’s processing capability is represented by another window called the receive window (rwnd). So, how does the receiver’s rwnd affect the sender’s behavior?

How does the receiver affect the sender’s data transmission? #

Similarly, I have drawn a simple diagram to illustrate how the receiver’s rwnd affects the sender:

As shown in the above diagram, after receiving a data packet, the receiver sends an ACK to the sender and includes its rwnd size in the “win” field of the TCP header. This allows the sender to know the size of the receiver’s rwnd. Then, when the sender sends the next TCP segment, it compares its cwnd with the receiver’s rwnd and determines the smaller value. The sender then controls the number of TCP segments it sends to not exceed this smaller value.

Regarding the impact of the receiver’s rwnd on the sender’s transmission behavior, we have encountered a case where the business feedback stated that the server was sending packets to the client very slowly, but the server itself was not busy and there appeared to be no network issues, so it was unclear what caused this. To investigate this, we used tcpdump to capture packets on the server and found that the ACKs from the client often had a win value of 0, indicating that the client’s receive window was 0. We then investigated on the client side and ultimately discovered a bug in the client code, which caused it to not read the received packets in a timely manner.

To monitor this behavior, I also wrote a patch for the Linux kernel: tcp: add SNMP counter for zero-window drops. This patch adds a new SNMP counter called TCPZeroWindowDrop. If the system encounters a situation where the receive window is too small to receive packets, this event will be triggered, and the event can be viewed through the TCPZeroWindowDrop field in /proc/net/netstat.

Because the TCP header has size limitations and the “win” field is only 16 bits, the maximum size that “win” can represent is 65535 (64K). Therefore, if we want to support larger receive windows to meet the requirements of high-performance networks, we need to enable the following configuration option, which is also enabled by default in the system:

net.ipv4.tcp_window_scaling = 1

For a more detailed design of this option, if you want to learn more, you can refer to RFC1323.

Alright, that’s all for now regarding the impact of TCP congestion control on the performance of a business network.

Class Summary #

TCP congestion control is a very complex behavior, and the content we covered in this class is only some basic parts of it. It is hoped that this basic knowledge can give you a rough understanding of TCP congestion control. Let me summarize the key points of this class:

The network congestion status is reflected in the congestion window (cwnd) of the TCP connection, which will affect the sending behavior of the sender;
The processing capacity of the receiver will also provide feedback to the sender, and this processing is represented by rwnd. rwnd and cwnd will work together to determine the maximum number of TCP packets that the sender can send;
The dynamic changes of TCP congestion control can be observed in real time through the tcp_probe tracepoint (corresponding to kernel versions 4.16+) or the tcp_probe kernel module (corresponding to kernel versions before 4.16). With tcp_probe, you can observe the data transmission status of TCP connections very well.

Homework #

Log in to the server via SSH, then turn off the network, wait a few seconds, and then turn it back on. Is the SSH connection still functional? Why? Please feel free to discuss it with me in the comments section.

Thank you for reading. If you found this lesson helpful, please feel free to share it with your friends. See you in the next lesson.