14 Case Study How to Determine Where the Problem Lies When There Is a Large End to End Tcp Delay

14 Case Study How to Determine Where the Problem Lies When There is a Large End-to-End TCP Delay #

Hello, I am Yao Fang Shao.

If you are an Internet practitioner, you should be familiar with the following scenario: the client sends a request to the server, the server processes the request, and then sends the response data back to the client. This is a typical Client/Server (C/S) architecture. For this type of request-response service, we have to face the following questions:

If the response time received by the client increases, is it due to a problem with the client itself, or is it because the server is slow, or is it due to network jitter?
Even if we have determined that it is a problem with the server or client, is it caused by the application itself or by the kernel?
Moreover, in many cases, this type of problem may only occur one or two times a day, making it difficult for us to capture real-time information.

In order to better handle these frustrating problems, I have also explored some methods for real-time tracking that do not significantly impact the application and system but can capture information when these failures occur, thereby helping us quickly locate the problem.

Therefore, in this session, I will share some of my practices in this area, as well as some specific cases that have been resolved.

Of course, these practices are not only applicable to C/S architectures, but also have reference value for other applications, especially for latency-sensitive applications. For example:

If my business is running in a virtual machine, how can I trace it?
If there is a proxy between the client and server, how can we determine if the problem is caused by the proxy?

So, let’s start with a case study of network jitter in a production environment with C/S architecture.

How to analyze network jitter in C/S architecture? #

The diagram above depicts a typical C/S architecture, which may involve a complex network between the client and server. However, for server developers or operations personnel, these intermediate networks can be understood as a black box, making it difficult to obtain detailed information about these networks, let alone debugging on these network devices. Therefore, in this case, I simplify them all as a router, and the client and server communicate with each other through this router. Examples of this type of architecture include database services such as MySQL and HTTP services in internet scenarios. The request we received to diagnose network jitter issues pertained to MySQL business. Therefore, we will use MySQL as an example to illustrate the analysis.

The MySQL business side reported that their requests occasionally timed out for a long time, but they were unsure of the cause of the timeouts. In relation to the diagram above, this corresponds to the time difference between the moment the response is received at point D and the time the request is sent at point A, which occasionally shows spikes. When network problems occur, using tcpdump to capture packets is a common means of analysis. If you are unsure of how to analyze network issues, it is advisable to first use tcpdump to preserve the scene of the incident.

If you have used tcpdump to analyze issues before, you most likely encountered difficulties. If you are familiar with Wireshark, it makes things relatively easier. However, for most people, the cost of learning Wireshark is also high, and memorizing all the commands in Wireshark is a cumbersome task.

In our case, our business personnel also used tcpdump to capture packets and preserve the scene, but they did not know how to analyze the tcpdump information. When I was helping them analyze this tcpdump information, it was difficult to associate the tcpdump information with the moments when the business was experiencing jitter. This is because although we know the moment the business jitter occurred, such as at 21:00:00.000, there could be a large number of TCP packets around this time, making it difficult to simply associate the two using timestamps. Additionally, and more importantly, we know that TCP is a stream, and a single request from the upper-level business can be divided into multiple TCP packets (TCP segments). Similarly, multiple requests can also be combined into one TCP packet. In other words, it is difficult to associate the TCP stream with application data. This is the challenge in using tcpdump to analyze business requests and responses.

To address the issue of tcpdump’s difficulty in analyzing application protocols, one approach is to save the data when using tcpdump, and then further parse these application protocols using tcpdump. However, you will find that using this method to handle jitter issues in a production environment is unrealistic, because in a production environment, there can be hundreds or even thousands of TCP connections, and we often don’t know which TCP connection the jitter occurs on. If we were to dump all these TCP streams, even if we only dump data relevant to the application protocol in the data section, it would be a burden on disk I/O. So, is there a better solution?

A better solution is to still associate it with the application protocol, but by abstracting these application protocols, we can simplify their parsing or even eliminate the need for parsing. For MySQL, the tool tcprstat is designed to handle this.

The tcprstat works roughly by utilizing the request-response characteristics of MySQL to simplify the handling of the protocol content. Request-response refers to when a request reaches the MySQL server, MySQL processes the request, then responds with a response, and after the client side receives the response, it sends the next request, which MySQL then receives and processes. In other words, this model is typically serial, processing one request before moving on to the next. Therefore, tcprstat can use the arrival of a packet to the MySQL server as the starting time point, and the time MySQL sends out the last packet as the ending time point, and the time difference between the two is the RT (Response Time). The process is roughly as shown in the following diagram:

tcprstat records the arrival time and sent time of a request, and then calculates the Round Trip (RT) and logs it. When we deployed tcprstat on the MySQL server side, we found that each RT value was very small and there was no significant delay, so it seemed that there was no problem on the server side. So, could the problem be on the client side?

When we wanted to deploy tcprstat on the client side to capture information, we found that it only supported deployment on the server side. Therefore, we made some modifications to tcprstat so that it could also be deployed on the client side.

This modification was not complicated because the MySQL protocol was also parsed on the client side, but the direction of the TCP flow was opposite to that on the server side. On the client side, requests are sent and responses are received, while on the server side, requests are received and responses are sent.

After the modification was completed, we began to deploy tcprstat to capture jitter in real-time. When there was jitter in the business, based on the information we captured, it showed that there was already a delay when the client received the response packet. In other words, the problem was not occurring on the client side either. This was a bit strange. If both the client and server were fine, could it be a network issue?

To clarify this, we used ping packets during the low peak hours of the business to check if there was a network problem. After pinging for a few hours, we found that the ping response time suddenly became very large, increasing from less than 1ms to tens or even hundreds of ms, and then quickly returning to normal.

Based on this information, we inferred that there might be congestion in a certain switch, so we contacted the switch management personnel to analyze the switch. After the switch management personnel checked each switch on this link one by one, they finally located a problematic access switch, which would occasionally have a long queue. The reason MySQL reported jitter while other businesses did not was simply because the other businesses on this access switch did not care about jitter. After the switch vendor helped fix this problem, this occasional jitter never occurred again.

This result seems simple, but the analysis process was actually quite complex. Because at first, we didn’t know where the problem was, so we could only troubleshoot step by step. Therefore, this analysis process took several days.

The network jitter problem caused by the switch was just one of many jitter cases we analyzed. In addition to these types of problems, we also analyzed many jitters caused by problems on the client side or the server side. After analyzing so many jitter problems, we started to wonder if we could create an automated analysis system for this type of problem. Moreover, after deploying and running tcprstat, we found that it had some shortcomings, mainly that it had a slightly high performance overhead. Especially when there were a large number of TCP connections, its CPU utilization could even exceed 10%, which was not suitable for long-term operation in our production environment.

The reason tcprstat has such high CPU overhead is actually similar to tcpdump. After collecting data from the bypass, it needs to be copied to the user space for processing, and this copying and processing time consumes CPU.

To meet the needs of the production environment, we developed a more lightweight analysis system based on tcprstat.

How to Lightweightly Determine Where Jitter Occurs? #

Our goal is to minimize monitoring overhead as much as possible in high-concurrency scenarios of a 10Gb NIC, preferably within 1%, without causing significant latency to the business. To reduce CPU overhead, many tasks need to be completed in the kernel, similar to the popular eBPF tracing framework: the kernel processes all the data and then returns the results to user space.

There are roughly two solutions to achieve this goal: one is to use kernel modules, and the other is to use lightweight kernel tracing frameworks.

The drawback of using kernel modules is that their installation and deployment can be inconvenient, especially when there are many different kernel versions in the production environment. For example, we have both CentOS-6 and CentOS-7 operating systems, each with many minor versions as well as our own release versions. It is difficult to unify the kernel version in production, as it would involve many changes and is not practical. Therefore, this situation determines that using kernel modules requires high maintenance costs. Moreover, the usability of kernel modules is poor, which causes business and operation personnel to reject their use, thereby increasing the difficulty of promotion. Considering these factors, we ultimately chose to develop based on the systemtap tracing framework. The reason for not choosing eBPF is that it has higher requirements for kernel versions, and many of our online systems use CentOS-7 kernels.

The tracing framework implemented based on systemtap is roughly as shown in the following figure:

It tracks each TCP stream, which corresponds to an instance of struct sock in the kernel, and records the timestamps of the TCP stream passing through points A/B/C/D in the kernel. Based on these timestamps, we can reach the following conclusions:

If the time difference between C and B is large, it indicates that there is jitter on the server side, otherwise it is a problem with the client or network.
If the time difference between D and A is small, it indicates a problem with the client side, otherwise it is a problem with the server or network.

By doing this, when there is jitter in the round-trip time (RTT), we can differentiate whether the jitter occurred on the client side, server side, or in the network. This greatly improves the efficiency of problem analysis and positioning. After identifying where the problem is, you can use the knowledge points we discussed in “Lesson 11”, “Lesson 12”, and “Lesson 13” to further analyze the specific reasons.

We have encountered many pitfalls in using systemtap, and we would like to share them with you here, hoping that you can avoid them:

The loading process of systemtap is a high-overhead process, mainly in terms of CPU usage. This is because systemtap’s loading involves compiling systemtap scripts, which takes time. You can pre-compile your systemtap scripts into kernel modules and load the module directly to avoid CPU overhead.
Systemtap has many overhead control options. You can set a threshold to control overhead in case it consumes too much CPU in exceptional situations.
After systemtap processes terminate abnormally, they may not unload the systemtap modules. When you find that a systemtap process has exited, you need to check whether it has also unloaded the corresponding kernel module. If not, you need to manually unload it to avoid unnecessary problems.

The client-server (C/S) architecture is a typical scenario in Internet services. How can we analyze problems in other scenarios? Next, let’s take the example of a virtual machine scenario.

How to Determine Whether Jitter Occurs on the Host or in the Virtual Machine in a Virtualized Environment? #

With the development of cloud computing, more and more businesses are being deployed in the cloud. Many companies use their own customized private clouds or public clouds. We also have many services deployed in our own private cloud, including virtual machines based on KVM and containers based on Kubernetes and Docker. Taking virtual machines as an example, when there is jitter on the server side, the business staff would like to know whether the jitter occurs within the virtual machine on the server side or on the host machine. To achieve this requirement, we only need to further extend and add new hook points to record the timestamps of TCP flows passing through the virtual machine, as shown in the diagram below:

In this way, we can determine whether the jitter occurs inside the virtual machine based on the time difference between F and E. For this requirement, we have made a similar modification to tcprstat, allowing it to identify whether jitter occurs inside the virtual machine. This modification is not complex either. By default, tcprstat only processes packets with the local machine as the destination address and does not process forwarding packets. Therefore, we enable it to support promiscuous mode, and then it can process forwarding packets. Of course, the specific network configuration of virtual machines may vary greatly, so you need to adjust it according to your actual virtual network configuration.

In summary, I hope you can think creatively and do reasonable data analysis based on your actual business scenario, instead of being limited to the specific scenarios listed in this lesson.

Class Summary #

In this class, we analyzed how to efficiently identify where a problem occurs when a real-time (RT) system experiences jitter, using the typical C/S architecture as an example. Let me summarize the key points of this class:

tcpdump is a tool that must be mastered for analyzing network problems, but analyzing problems with it is not easy. When you are unsure how to analyze network problems, you can start by using tcpdump to save the on-site information.
TCP is a data stream. The key to analyzing specific application problems is how to associate TCP streams with specific business requests/responses. You need to combine your business model to make reasonable associations.
RT jitter problems are tricky, and you need to develop efficient problem analysis tools based on your business model. If you are using Redhat or CentOS, you can consider using systemtap; if it is Ubuntu, consider using lttng.

Homework #

Based on the first figure of this lesson, can we use the difference between the moment when the TCP flow reaches point B (arrival time at the server) and the moment when the TCP flow passes point A (arrival time at the client) as the network latency? Why?

Taking into account the round-trip time delay (RTT) that we discussed in Lesson 13, can we further consider using RTT as the network latency? Why? Feel free to discuss with me in the comments section.

Thank you for reading. If you found this lesson helpful, please share it with your friends. See you in the next lesson.