25 Distributed Trace How to Trace Slow Requests Across Dozens of Distributed Components

25 Distributed Trace - How to Trace Slow Requests Across Dozens of Distributed Components #

Hello, I’m Tang Yang.

After studying the previous lessons, your vertical e-commerce system has completed the basic service decomposition after introducing the RPC framework and registration center, and the system architecture has also changed:

Now, your system is running smoothly, the boss is very happy, and you are also much more at ease. Moreover, you believe that after the service decomposition, the scalability of the service has been greatly enhanced. You can smoothly scale the service nodes by horizontal expansion, and you are more confident in dealing with peak traffic.

But then a problem arises: You have discovered through monitoring that the core order placement interface of the system has a small number of slow requests during the evening rush hour, and users also complain that it takes a long time to wait when placing an order on the app. The order placement process may involve calling multiple RPC services or using multiple resources. For a while, it is difficult for you to quickly determine which service or resource is causing the overall process to slow down. So, you and your team are trying to figure out how to troubleshoot this problem.

How to troubleshoot slow requests in a unified architecture #

In a distributed environment, troubleshooting slow requests can be more challenging because requests need to be called between multiple services. To start with, let’s keep it simple and take a look at how slow requests are troubleshooted in a unified architecture.

The simplest approach is to print the time taken for each step of the order operation and compare the timing data to identify the step with the highest delay. Then, we can analyze and optimize that particular step if necessary. If needed, you can also add logs for sub-steps within each step for further troubleshooting. The code might look like this:

long start = System.currentTimeMillis();

processA();

Logs.info("process A cost " + (System.currentTimeMillis() - start)); // Print the time taken for step A

start = System.currentTimeMillis();

processB();

Logs.info("process B cost " + (System.currentTimeMillis() - start)); // Print the time taken for step B

start = System.currentTimeMillis();

processC();

Logs.info("process C cost " + (System.currentTimeMillis() - start)); // Print the time taken for step C

This is the simplest implementation. After printing the logs, we can log in to the server and search for keywords to view the timing information for each step.

Although this approach is straightforward, you may encounter a problem: Since multiple order requests are processed in parallel, the logs for the timing of each step in these requests are interleaved. You cannot easily determine which logs belong to the same request, making it difficult to identify the step with the longest delay. So, how can you connect the timing information for each step of a single request?

One simple approach is to add a unique identifier to each line of logs belonging to the same request. This way, once you have this identifier, you can query and view the timing of all steps in the request chain. We call this identifier a requestId. We can generate a requestId at the entry point of the program and store it in the thread’s context. This way, we can easily retrieve the requestId from the thread’s context whenever needed. The code implementation might look like this:

String requestId = UUID.randomUUID().toString();

ThreadLocal<String> tl = new ThreadLocal<String>(){

    @Override

    protected String initialValue() {

        return requestId;

    }

}; // requestId stored in the thread's context

long start = System.currentTimeMillis();

processA();

Logs.info("rid : " + tl.get() + ", process A cost " + (System.currentTimeMillis() - start)); // Add requestId to the log

start = System.currentTimeMillis();

processB();

Logs.info("rid : " + tl.get() + ", process B cost " + (System.currentTimeMillis() - start));

start = System.currentTimeMillis();

processC();

Logs.info("rid : " + tl.get() + ", process C cost " + (System.currentTimeMillis() - start));

With the requestId, you can have a clear understanding of the timing distribution for a specific calling chain.

So, you added a lot of logs to your code to troubleshoot the slow order operation. Soon enough, you discovered that a slow database query was causing the delay in order processing. You then optimized the database indexes, and the problem was finally resolved.

Just when you thought you could breathe a sigh of relief, more problems arose: Some users reported slow loading of certain product business operations, and the homepage of the online store was slow to load. You started adding timing logs to your code and realized that adding logs and restarting the service every time you troubleshoot an API is not an ideal solution. Therefore, you started thinking about a better solution.

In my experience, the most common cause of slow API response times is network communication, such as querying a database, cache, or relying on third-party services. So, we only need to implement aspect-oriented programming (AOP) for the client classes involved in these communications and print their timing information.

Speaking of AOP, you should be familiar with it. It is an extension of object-oriented programming that allows adding functionality to an application without modifying the source code, such as authentication and logging. If you’re not very familiar with the concept of AOP, let me give you an analogy to help you understand.

It’s similar to when a developer submits code to the code repository, they need to compile the code, build it, and execute unit tests to ensure the submitted code is error-free. However, if every developer has to perform these tasks every time they submit code, it would be a burden. So, you can configure a continuous integration pipeline that automatically performs these operations after code submission. This continuous integration process can be considered as an aspect.

Generally speaking, there are two categories of AOP implementations:

One is static proxy, with AspectJ being the typical representative. It injects aspect code during the compilation.

The other is dynamic proxy, with Spring AOP being the typical representative. It injects aspect code at runtime. What are the differences between the two approaches? Taking Java as an example, in the case of source code Java files, they are first compiled into Class files by the Java compiler, and then the Java Virtual Machine loads the Class files and performs necessary verification and initialization before they can be run.

Static proxy inserts code during compilation, which increases the compilation time and gives you the impression that the startup time is longer. However, once the code is inserted during compilation, there is basically no impact on performance at runtime.

On the other hand, dynamic proxy does not modify the generated Class files. Instead, it generates a proxy object at runtime, which enhances the bytecode of the original object to perform the operations required by the aspect. Since a proxy object needs to be generated at runtime, the performance of dynamic proxy is worse than that of static proxy.

The reason we do aspect-oriented programming is to generate some debugging logs, so we expect to minimize the impact on the performance of the original interface. Therefore, I recommend using the static proxy approach to implement aspect-oriented programming.

If you need to add aspects to your system to perform some validation, rate limiting, or logging work, I also recommend considering using the static proxy approach. Below is a simple implementation of AspectJ for aspect-oriented programming:

@Aspect
public class Tracer {

    @Around(value = "execution(public methodsig)", argNames = "pjp") // Replace 'execuiong' with the method signature to be implemented by aspect
    public Object trace(ProceedingJoinPoint pjp) throws Throwable {

        TraceContext traceCtx = TraceContext.get(); // Get trace context, the initialization of context can be done at the entry point of the program

        String requestId = reqCtx.getRequestId(); // Get requestId

        String sig = pjp.getSignature().toShortString(); // Get method signature

        boolean isSuccessful = false;
        String errorMsg = "";
        Object result = null;

        long start = System.currentTimeMillis();

        try {
            result = pjp.proceed();
            isSuccessful = true;
            return result;
        } catch (Throwable t) {
            isSuccessful = false;
            errorMsg = t.getMessage();
            return result;
        } finally {
            long elapseTime = System.currentTimeMillis() - start;
            Logs.info("rid : " + requestId + ", start time: " + start + ", elapseTime: " + elapseTime + ", sig: " + sig + ", isSuccessful: " + isSuccessful + ", errorMsg: " + errorMsg );
        }
    }
}

By implementing this, you can print the time consumed for every database access, cache operation, and external interface access in each interface of your system. Each request may need to print dozens of logs, so if the QPS of your e-commerce system is 10,000, it will generate hundreds of thousands of logs per second, resulting in a heavy disk I/O load. In this case, you need to consider how to reduce the number of logs.

You can consider sampling the requests. The sampling method is also simple. For example, if you want to sample 10% of the logs, you can only print the requests where “requestId % 10 == 0”.

After having these logs, when you are given a requestId, you find that you cannot determine which server the request arrived at. Therefore, you have to log in to all servers to search for the requestId to locate the request. This undoubtedly increases the time for troubleshooting.

One possible solution is not to print the logs to local files, but to send them to message queues, which are then written to centralized storage by message processing programs, such as Elasticsearch. This way, when troubleshooting, you only need to look up relevant records in Elasticsearch based on the requestId. After adding message queues and Elasticsearch, the architecture diagram of this troubleshooting program will also change:

To summarize, in order to investigate the reasons for the long response time of a single request, we have done the following:

When recording logs, we use the requestId to link the logs, making it convenient to compare the time consumption of multiple steps in a single request.
We use the static proxy approach for aspect-oriented programming to avoid adding a large amount of code for printing time-consuming logs in business code, reducing the invasiveness of the code and the injection of compilation-time code.
We have increased the log sampling rate to avoid printing all logs.
Finally, to avoid the need to search logs on multiple servers during troubleshooting, we use a message queue to centralize the logs and store them in Elasticsearch.

How to Perform Distributed Tracing #

You may ask: since the title is “Distributed Tracing: How to Troubleshoot Slow Requests Spanning Dozens of Distributed Components,” why am I spending a lot of space explaining how to troubleshoot issues in a monolithic architecture? The reason is that in a distributed environment, you basically build the middleware for distributed tracing based on the points I mentioned earlier.

In a monolithic architecture, the logs of the entire request’s latency are recorded on a single server. However, in a microservices scenario, a single request may cross multiple RPC services, resulting in the logs of a single request being distributed across multiple servers.

Of course, you can also use the requestId to link the logs on multiple servers together. However, relying solely on the requestId makes it difficult to express the calling relationship between services clearly. Therefore, we use traceId + spanId to record the calling relationship between services (where traceId is the same as requestId), with traceId stringing together a single request and spanId recording each RPC call. This may sound abstract, so let me give you a specific example.

For example, your request comes from the user side and arrives at service A, which then calls services B and C separately. Service B then calls services D and E.

Let me explain the content of the figure to you:

After the user reaches service A, a traceId of 100 and a spanId of 1 are initialized.

When service A calls service B, the traceId remains the same, but the spanId is marked as 1.1, indicating that the spanId of the previous level is 1, and the calling order of this level is 1.

When A calls service C, the traceId remains unchanged, but the spanId changes to 1.2, indicating that the spanId of the previous level is still 1, and the calling order becomes 2, and so on.

In this way, we can clearly see the calling relationship between services in the logs, making it convenient to adjust the log order in subsequent calculations and print out the complete call chain.

So when is the spanId generated, and how is it passed on? This part can be considered an extension point that can help you understand the implementation principle of distributed tracing middleware.

First, before service A initiates an RPC request to service B, it retrieves the current traceId and spanId from the thread context. Then, based on the logic mentioned above, it generates the spanId for this RPC call, serializes the spanId and traceId, and assembles them into the request body to be sent to service B.

When service B receives the request, it deserializes the spanId and traceId from the request body and sets them in the thread context for the next RPC call to use. Before service B completes the call and returns the response, it calculates the execution time of service B and sends it to the message queue.

Of course, in service B, you can still use aspect-oriented programming to obtain the response time of all calls to databases, caches, and HTTP services. However, when sending it to the message queue, you need to add the spanId and traceId from the current thread context.

In this way, both the response time of resources like databases and the response time of RPC services are aggregated in the message queue and, after some processing, finally written to Elasticsearch for developers and operation colleagues to query.

In this case, the problem you are most likely to encounter is performance, specifically the impact on disk I/O and network I/O due to the introduction of distributed tracing middleware. And the “pitfall avoidance” guide I give you is: if you are developing your own distributed tracing middleware, make sure to provide a switch to easily turn off log printing online; if using open-source components, you can start with a lower log sampling rate and observe the system performance before adjusting it to an appropriate value.

Summary of the Lesson #

In this lesson, I have introduced to you how to troubleshoot the bottleneck in a slow request in both monolithic and service-oriented architectures. There are several key points that you need to understand:

There are two main requirements for service tracing. First, the code should be non-intrusive, and you can use aspect-oriented programming to achieve this. Second, the performance should not be significantly impacted. I recommend using static proxies and log sampling to minimize the impact of tracing logs on system performance.

Whether it is a monolithic system or a service-oriented architecture, whether it is service tracing or troubleshooting business problems, you need to add a requestId in the log. This allows you to link your logs together and present a complete problem scenario. If the requestId can be generated on the client side and passed to the server when requesting a business interface, then the log system on the client side can also be integrated, providing greater help in troubleshooting.

In fact, distributed tracing system is not a new technology, but an integration of several existing technologies. It is not complicated to implement but can help you achieve cross-process call chain visualization, service dependency analysis, and provide support for performance optimization and problem troubleshooting. Therefore, in the process of microservices transformation, it is a necessary option. Whether you use open-source solutions like Zipkin or Jaeger, or develop it internally within your team, you should make it play its value as soon as possible before the microservices transformation is completed.