11 Basics How System Configurations Affect the Establishment and Termination of Tcp Connections

11 Basics How System Configurations Affect the Establishment and Termination of TCP Connections #

Hello, I’m Shaoyafang.

If you have done network-related development or analyzed network-related issues on Linux, you must have complained about the various configuration options that can be overwhelming in the Linux system. You may have also been troubled by the following questions:

Why can’t the Client establish a connection with the Server?
Why did I receive a reset from the Server after the three-way handshake?
Why does establishing a TCP connection take so much time?
Why are there so many connections in the time-wait state in the system? How should this be handled?
Why are there so many connections in the close-wait state in the system?
Considering my business scenario, how should I configure all these network options?
…

Because the network area involves various scenarios, the Linux kernel needs to handle various network scenarios, and the strategies for handling different network scenarios may vary. The default network configurations in the Linux kernel may not necessarily be suitable for our scenarios, which can lead to some unexpected behaviors in our business.

Therefore, to make the behavior of our business meet expectations, we need to understand the relevant network configurations in Linux and make these configurations more suitable for our business. There are numerous network configuration options in Linux, and to help you better understand them, I will start by using the most commonly used TCP/IP protocol as an example to explain how a network connection is established and terminated.

Which configuration options affect the process of establishing a TCP connection? #

The figure above shows the process of establishing a TCP connection. The establishment of a TCP connection involves the client calling connect() and the server returning a successful accept() call. Throughout the process of establishing a TCP connection, various behaviors are controlled by configuration options.

After the client calls connect(), the Linux kernel starts the three-way handshake.

First, the client sends a SYN packet to the server. However, this SYN packet may be lost during transmission or may not be processed by the server due to other reasons. In this case, the client side triggers the timeout retransmission mechanism. But the retransmission cannot continue indefinitely, as the number of retransmissions is also limited. This is determined by the tcp_syn_retries configuration option.

Assuming tcp_syn_retries is set to 3, the SYN packet retransmission strategy can be roughly summarized as follows:

After the client sends the SYN packet, if no response is received from the server after 1 second, the first retransmission is performed. If no response is received from the server after 2 seconds, the second retransmission is performed. The retransmission is performed a total of tcp_syn_retries times.

For tcp_syn_retries set to 3, a total of 3 retransmissions are performed. This means that after the first SYN packet is sent, the client will wait for (1 + 2 + 4 + 8) seconds. If no response is received from the server, the connect() call will produce an ETIMEOUT error.

The default value of tcp_syn_retries is 6, which means that if SYN packet transmission fails continuously, an ETIMEOUT error will occur after (1 + 2 + 4 + 8 + 16 + 32 + 64) seconds, which is 127 seconds.

We have encountered a similar situation in our production environment. The server was taken offline for some reason, but the client was not notified, so the client’s connect() call was blocked for 127 seconds before attempting to connect to a new server. Such a long timeout waiting time is unacceptable for an application.

Therefore, in general, we often reduce the value of tcp_syn_retries for servers in data centers to shorten the blocking time. Because for a data center, the network quality is usually very good. If a response is not obtained from the server, it is likely that the server itself has a problem. In this case, it is a good choice for the client to try to connect to another server as soon as possible. Therefore, the client typically makes the following adjustment:

net.ipv4.tcp_syn_retries = 2

In some cases, a 1-second blocking time may still be too long, so sometimes the initial timeout time for the three-way handshake is adjusted from the default value of 1 second to a smaller value, such as 100 milliseconds. This reduces the overall blocking time. This is why some network optimizations are often performed in data centers.

If the server does not respond to the client’s SYN, in addition to the case we mentioned earlier, where the server no longer exists, it may also be because the server is too busy to respond or there are too many half-open connections (incomplete connections) that cannot be processed in time.

A half-open connection refers to a connection that has received a SYN but has not yet replied with a SYNACK. Each time the server receives a new SYN packet, it creates a half-open connection and adds it to the half-open connection queue (syn queue). The length of the syn queue is determined by the tcp_max_syn_backlog configuration option. When the number of backlogged half-open connections exceeds this value, new SYN packets will be discarded. For a server, there may be a sudden surge in new connections, so we can adjust this value appropriately to prevent SYN packets from being discarded, which would result in the client not receiving the SYNACK:

net.ipv4.tcp_max_syn_backlog = 16384

If there are many backlogged half-open connections in the server, it may also be because there are some malicious clients performing SYN flood attacks. The typical SYN flood attack is as follows: The client continuously sends SYN packets to the server at a high frequency, and the source IP address of these SYN packets keeps changing. Therefore, each time the server receives a new SYN packet, it assigns a half-open connection to it. However, the server’s SYNACK is sent to the incorrect client IP based on the previous SYN packet, so it cannot receive the ACK packet from the client correctly, resulting in the failure to establish a TCP connection and causing the server’s half-open connection queue to be exhausted, making it unable to respond to normal SYN packets. In order to prevent SYN Flood attacks, the Linux kernel introduced the SYN Cookies mechanism. What is the principle of SYN Cookies?

When the server receives a SYN packet, instead of allocating resources to save the client’s information, it calculates a cookie value based on this SYN packet and then records the cookie in the SYNACK packet and sends it out. For a normal connection, the cookie value will be brought back with the client’s ACK packet. Then the server checks the legitimacy of this ACK packet based on the cookie. If it is legitimate, a new TCP connection is created. By doing so, SYN Cookies can prevent some SYN Flood attacks. Therefore, it is recommended to enable SYN Cookies for Linux servers:

net.ipv4.tcp_syncookies = 1

The SYNACK packet sent from the server to the client may also be discarded or not received due to some reason. In this case, the server will also retransmit the SYNACK packet. Similarly, the number of retransmissions is controlled by the configuration option tcp_synack_retries.

The retransmission strategy of tcp_synack_retries is consistent with what we mentioned earlier for tcp_syn_retries, so we won’t draw a diagram to explain it again. The default value is 5 in the system. For data center servers, such a large value is usually not necessary, and it is recommended to set it to 2:

net.ipv4.tcp_synack_retries = 2

After the client receives the SYNACK packet from the server, it sends an ACK, and when the server receives this ACK, the three-way handshake is completed, generating a complete TCP connection, which is added to the accept queue. Then the server calls the accept() function to complete the establishment of the TCP connection.

However, just like the length of the half-open queue (syn queue) is limited, the length of the accept queue (complete queue) is also limited, in order to prevent the server from wasting too many system resources by not calling accept() in time.

The length of the accept queue (complete queue) is controlled by the backlog parameter in the listen(sockfd, backlog) function, and the maximum value of this backlog is somaxconn. Before version 5.4 of the kernel, the default value for somaxconn is 128 (it has been adjusted to 4096 by 5.4), and it is recommended to increase this value appropriately:

net.core.somaxconn = 16384

When the number of backlogged full connections in the server exceeds this value, new full connections will be discarded. When the server discards a new connection, sometimes it needs to send a reset to notify the client so that the client will not retry again. However, the default behavior is to discard it without notifying the client. Whether to send a reset to the client is controlled by the configuration option tcp_abort_on_overflow, which defaults to 0, meaning no reset is sent to the client. It is recommended to also set this value to 0:

net.ipv4.tcp_abort_on_overflow = 0

This is because if the server is unable to accept() in time and the accept queue is full, it is often caused by a sudden influx of new connection requests. Under normal circumstances, the server will quickly recover, and then the client can successfully establish a connection after retrying. In other words, setting tcp_abort_on_overflow to 0 gives the client a chance to retry. Of course, you can decide whether to enable this option based on your actual situation.

After a successful accept() call, a new TCP connection is established and enters the ESTABLISHED state:

The above figure shows the TCP state transition during the process from the client calling connect() to the server successfully returning from accept(). These states can be seen through the netstat or ss command. At this point, both the client and the server can communicate normally.

Next, let’s look at which system configuration options affect the TCP connection termination process.

Which configuration options affect the disconnection process of TCP connections? #

As shown above, when an application calls close(), it sends a FIN packet to the other end and then receives an ACK; the other end also calls close() to send FIN, and then this end also acknowledges to the other end with an ACK, which is the four-way handshake process of TCP.

The side that initiates the close() is called active close, while the side that receives the FIN packet from the other end and then calls close() to close is called passive close. During the four-way handshake process, three TCP states need to be paid attention to, which are the three states marked in deep red in the figure: FIN_WAIT_2 and TIME_WAIT of the active closing side, and CLOSE_WAIT of the passive closing side. Except for CLOSE_WAIT state, the other two states have corresponding system configuration options to control them.

Let’s first look at the FIN_WAIT_2 state. After TCP enters this state, if this end does not receive the FIN packet from the other end for a long time, it will remain in this state and continue to consume system resources. In order to prevent this resource consumption, Linux sets a timeout for this state, which is tcp_fin_timeout by default and is 60 seconds. After this time limit is exceeded, the connection will be automatically destroyed.

As for why this end does not receive the FIN packet from the other end for a long time, it is usually because the other end machine has some problems, or it is too busy to close() in time. Therefore, we usually recommend reducing the tcp_fin_timeout as much as possible to avoid the resource consumption in this state. For machines in a data center, setting it to 2 seconds is sufficient:

net.ipv4.tcp_fin_timeout = 2

Let’s move on to the TIME_WAIT state. The purpose of maintaining the TIME_WAIT state is that the last ACK packet sent may be discarded or delayed, so the other end may send the FIN packet again. If the TIME_WAIT state is not maintained, when the FIN packet from the other end is received again, this end will reply with a Reset packet, which may cause some anomalies.

Therefore, maintaining the TIME_WAIT state for a period of time can guarantee the normal disconnection of the TCP connection. The default survival time of TIME_WAIT in Linux is 60 seconds (TCP_TIMEWAIT_LEN), which may be too long for a data center. Therefore, sometimes the kernel is modified to optimize it, or the value is set to adjustable through sysctl.

Keeping TIME_WAIT state for such a long time also wastes system resources, so there is a configuration option to limit the maximum number of this state, which is tcp_max_tw_buckets. For a data center, the network is relatively stable and there are usually no abnormalities in FIN packets, so it is recommended to reduce this value:

net.ipv4.tcp_max_tw_buckets = 10000

After the client closes the connection with the server, it may quickly establish a new connection with the server. However, since there are only 65536 TCP ports, if the connections in TIME_WAIT state are not reused, it may be unable to create new connections due to port occupation when the application is restarted quickly. So it is recommended to enable the option of reusing TIME_WAIT:

net.ipv4.tcp_tw_reuse = 1

There is another option, tcp_tw_recycle, to control the TIME_WAIT state. However, this option is very dangerous because it may cause unexpected problems, such as packet loss in a NAT environment. Therefore, it is recommended to disable this option:

net.ipv4.tcp_tw_recycle = 0

Because opening this option caused too many problems, the latest version of the kernel has deleted this configuration option: tcp: remove tcp_tw_recycle

As for the CLOSE_WAIT state, there is no corresponding configuration option in the system. But this state is also a dangerous signal. If there are many connections in this state, it often indicates a bug in the application where close() is not called under certain conditions. We have encountered many such issues in our production environment. So if you have many connections in the CLOSE_WAIT state in your system, it is best to investigate your application and see where close() is missing.

Here, we have finished discussing the things to be aware of during the four-way handshake process of TCP.

Well, that’s all for this lesson.

Class Summary #

In this class, we have discussed a lot of configuration options. I have summarized these options in the table below for your convenience:

配置项	说明
tcp_max_syn_backlog	SYN队列的最大长度
somaxconn	用来限制未完成的连接数
tcp_max_tw_buckets	TIME_WAIT状态的最大数量

Of course, some configuration options can be flexibly adjusted based on your server load, CPU, and memory size. For example, tcp_max_syn_backlog, somaxconn, and tcp_max_tw_buckets. If you have sufficient physical memory and CPU cores, you can increase these values appropriately. These are often empirical values.

In addition, the purpose of this class is not only to let you understand these configuration options but also to help you understand the mechanisms behind them. This way, when you encounter problems, you can have a rough analysis direction.

Homework #

Please use the tcpdump tool to observe the three-way handshake and four-way handshake processes of TCP after class to consolidate today’s learning. Feel free to share your thoughts in the comments section.

Thank you for reading. If you found this lesson to be helpful, please feel free to share it with your friends. See you in the next lecture.