19 Distributed Environment How to Quickly Locate Problems in a Distributed Setting

19 Distributed Environment - How to quickly locate problems in a distributed setting #

Hello, I am He Xiaofeng. In the previous lecture, we learned how to establish a reliable security system, with the key point being “authentication”. We can increase the security of RPC calls by dynamically generating keys through a unified authentication service.

After reviewing the key points of the previous lecture, we will now delve into today’s topic and explore how to quickly locate problems in RPC in a distributed environment. The importance of this is self-evident. Only by accurately locating problems can we better solve them.

What are the difficulties of locating problems in a distributed environment? #

Before we continue, I want you to think about how we locate problems during development and production in general.

During the development process, it is relatively easy to troubleshoot issues. We can run the code in our local development environment using an IDE and debug it. This process makes it easy to find problems.

However, in the production environment, where the code is running online, we cannot debug it. In these cases, the simplest and most effective way to locate problems is by printing log messages. In fact, this is how we locate the majority of issues.

But what if we are dealing with a distributed production environment? For example, consider the following scenario:

We have set up a distributed application system. In this system, there are 4 services running: Service A, Service B, Service C, and Service D. The dependency relationship is A->B->C->D, and these services are deployed on different machines. During RPC calls, if a service encounters an exception in its business logic, it will throw the exception back to the caller. So, if there is an exception in any service within this call chain, how do we locate the problem?

Diagram

Your initial reaction might still be to print log messages, and that’s a valid approach.

Let’s say we discover that Service A is throwing an exception. Is it possible that the exception is being thrown by Service B, C, or D? Absolutely. So, how do we determine which step in the entire application system is causing the problem, and on which machine it occurred? Where should we print the log messages? Moreover, if we need to print log messages for troubleshooting, we have to modify the code, which means we need to redeploy the services. What if these services involve multiple teams and departments? Consider the communication costs involved in such scenarios.

As you can see, the difficulty of locating problems in a distributed environment lies in the complex dependencies between the sub-applications and sub-services. It is often challenging to determine which service or step within the service is causing the problem. Simply troubleshooting by logging messages requires investigating each sub-application and sub-service one by one, which is not efficient. If these services happen to involve multiple teams and departments, it can be an even more challenging and time-consuming process.

How to quickly locate problems? #

Once we understand the difficulties, we can tackle them specifically. I will provide two practical methods for quickly locating problems in a distributed environment using RPC.

Method 1: Utilizing well-encapsulated exception information #

As we mentioned earlier, it is difficult to locate problems through logs due to the complex dependencies between sub-applications and sub-services. Therefore, we need to find a way to pinpoint which sub-service of a sub-application is causing the problem using the logs.

In fact, the exception information printed by the RPC framework includes all the necessary information for locating the exception, such as the type of exception causing the problem (e.g., serialization issue or network timeout), whether the exception occurred on the client or server side, the IP addresses of the client and server, the service interface and group, etc. The details are shown in the following figure:

With this information, in the process of A->B->C->D, we can quickly identify that the problem lies with the C service. The service interface is com.demo.CSerivce, the client IP is 192.168.1.2, the server IP is 192.168.1.3, and the reason for the problem is that the business thread pool is full.

Therefore, an excellent RPC framework should provide detailed encapsulation of exceptions, categorize different types of exceptions, assign clear exception codes for each category, and compile them into a concise document. Users can quickly locate problems and find the causes by referring to the document based on the exception codes. The exception information should also include important information required for troubleshooting, such as the service interface name, service group, client and server IP addresses, and the reason for the exception. In summary, in a complex distributed application system, users should be able to quickly locate problems based on exception information.

The above applies to exceptions related to the RPC framework itself, such as serialization exceptions, response timeout exceptions, connection exceptions, etc. What about business logic exceptions on the server side? The service provider should also encapsulate its own business exception information, allowing service consumers to quickly locate problems using the exception information.

Method 2: Utilizing distributed tracing #

Whether it is the RPC framework itself or the services provided by the service provider, as long as the exception information is properly encapsulated, it becomes easier to locate problems in a distributed environment. Does this meet our requirements for problem localization?

Let’s go back to the distributed scenario mentioned earlier: we have built a distributed application system consisting of 4 sub-services, with the dependency relationship among the 4 services being A->B->C->D.

Suppose these 4 services are maintained by 4 different colleagues from different departments. When service A calls service B, the colleague who maintains service A may not be aware of the existence of services C and D. For service A, its downstream service is only B. So, what if service C or service D encounters an exception and eventually throws it to A in the entire call chain?

In such a situation, how can the colleague who maintains service A locate the problem?

Since service A may not be aware of the existence of downstream services C and D, the colleague who maintains service A will directly contact the colleague who maintains service B. Then, the colleague who maintains service B will continue to contact the service providers of the downstream services until the problem is found. However, this approach can be costly!

Now let’s change our perspective. In fact, all we need to know is the entire call chain. Service A calls downstream service B, and service B calls its dependent downstream services. If the colleague who maintains service A can clearly know the entire call chain and accurately identify which part of the call chain the problem occurred, that would be great. This is like sending and receiving express deliveries. We can see the delivery route on the platform and know in real-time when the delivery arrives at each station. So when we don’t receive the delivery on time, we can immediately know where the delay occurred.

In a distributed environment, if we want to know the entire chain of service calls, we can use “distributed tracing”.

First, let’s introduce the distributed tracing system. Literally, distributed tracing is the process of reconstructing a distributed request into a complete call chain. We can track every step of the distributed request in the entire call chain, such as whether the call is successful, what exceptions are returned, which service node is called, and the duration of the request.

This way, if we find a problem with a service call, we can quickly locate the problem, even when multiple departments are involved.

Next, let’s take a look at how distributed tracing is integrated in RPC frameworks?

Distributed tracing involves the concepts of Trace and Span. Let me explain each concept.

Trace represents the entire chain, and each distributed request generates a Trace. Each Trace has a unique identifier called TraceId, which is used to distinguish each Trace in the distributed tracing system.

Span represents a segment of the entire chain, which means a Trace consists of multiple Spans. Within a Trace, each Span also has a unique identifier called SpanId, and Spans are organized in a parent-child relationship. Continuing with the previous example, in the case of A->B->C->D, in the entire call chain, under normal circumstances, there would be three Spans: Span1 (A->B), Span2 (B->C), and Span3 (C->D). In this case, Span3’s parent Span is Span2, and Span2’s parent Span is Span1.

The relationship between Trace and Span is shown in the following diagram:

There are many ways to implement distributed tracing systems, but they all revolve around the concepts of Trace and Span. Mastering these two concepts means you have grasped the principles of most implementation methods. Now let’s take a look at how these two concepts are used in integrating distributed tracing in RPC frameworks.

The most important things for integrating distributed tracing in RPC are “instrumentation” and “propagation”.

“Instrumentation” means that in order for the distributed tracing system to obtain complete chain information for a distributed call, the data for this call must be collected. This data collection is done through instrumentation of the RPC framework for distributed tracing.

When the client of an RPC call accesses the server, it triggers the distributed tracing instrumentation before sending the request message. When the client receives the server response, it also triggers the distributed tracing instrumentation. Similar instrumentation is also present on the server side. These instrumentations ultimately record a complete Span, and the origin of the chain will record a complete Trace, which is then reported to the distributed tracing system.

“Propagation” refers to the upstream caller passing the Trace information and parent Span information to the downstream service server. The downstream service triggers instrumentation, processes this information, and each child Span in the distributed tracing system would contain the relevant information of its parent Span and Trace.

Summary #

Today we explained how to quickly locate problems in a distributed environment. The difficulty lies in the complex dependencies of distributed systems, making it difficult to determine the exact point of failure. Moreover, in large distributed systems, there are often cross-department and cross-team collaborations, resulting in high communication costs when troubleshooting.

To quickly locate problems in a distributed environment, the RPC framework should comprehensively encapsulate framework-specific exceptions. Each type of exception should have a clear identification code, and these codes should be organized into a concise document. Exception messages should include the service interface name, service group, IP addresses of the client and server, and the cause of the exception. This makes it convenient for the users of the framework.

Furthermore, service providers should also encapsulate exceptions when providing services, making it easier for upstream consumers to troubleshoot problems.

In a distributed environment, we can use distributed tracing to quickly locate problems, especially in collaborations across multiple departments. This approach saves time and reduces communication costs, enabling us to solve real-world problems with the highest efficiency.

After-class Reflection #

In a distributed environment, what other methods do you know for quickly identifying problems?

I look forward to your sharing in the comments section. Feel free to share this article with your friends and invite them to join in the discussion. See you in the next class!