11 Blockchain Tracing How to Formulate a Distributed Blockchain Tracing System

11 Blockchain Tracing How to Formulate a Distributed Blockchain Tracing System #

Hello, I’m Xu Changlong. In this lesson, we will talk about how to implement distributed tracing.

Distributed tracing services are used for services with multiple writes and few reads, and they are crucial for troubleshooting issues online. I have worked on a system that supports multiple business lines and uses around 200 servers. It is not easy to troubleshoot issues in a system of this scale.

Therefore, I have designed a simple full log-based distributed tracing system by combining the features of the ELK (Elasticsearch, Logstash, and Kibana) stack. By linking the logs together, it greatly reduces the difficulty of system troubleshooting.

Currently, the open-source distributed tracing solutions on the market are quite abstract. When the complexity of your business reaches a certain level, it is necessary to customize a tracing system to meet your specific business needs.

In fact, implementing a distributed tracing system is not difficult. The real challenges lie in instrumentation, data transmission, storage, and analysis. If your team possesses these capabilities, you can quickly develop a tracing system. So let’s take a look at how to implement a simple customized distributed tracing system together.

Current Status of the Monitoring Industry #

Before learning how to create a simple distributed tracing system, let’s first understand the current status of the monitoring industry in order to better grasp the design characteristics of distributed tracing.

Recently, there has been a major revolution in the monitoring industry, and modern tracing standards are no longer limited to tracing requests. Integration has begun, where new standards are similar to our customized design approach for distributed tracing, combining trace, metrics, and logs into one system for construction.

Before this, there were mainly three types of monitoring systems: metrics, tracing, and logging.

Common open-source metrics tools include Zabbix, Nagios, Prometheus, InfluxDb, and OpenFalcon, which are mainly used for aggregating and summarizing various quantitative indicators. For example, they monitor remaining capacity, requests per second, average response time, and the number of requests within a certain time period.

Common open-source distributed tracing tools include Jaeger, Zipkin, Pinpoint, and Skywalking. These tools analyze the monitoring and analysis systems for each requested trace. By using a trace ID, we can trace the dependencies and invocation chains of a request, and analyze the duration of fault points and propagation processes.

Common open-source logging tools include ELK, Loki, and Loggly, which primarily collect, classify, and organize text logs. They summarize and warn about error logs, and analyze system errors and exceptions.

These three types of monitoring systems can be said to be the main pillars of large service cluster monitoring. Each has its advantages, but they have always been built separately. This has caused some fragmentation and functional redundancy in our system monitoring. Moreover, each standard requires the independent construction of a system, and different interfaces are used to analyze the same faults, making problem diagnosis inconvenient.

With industry development, the three-in-one standard has emerged. This is the OpenTelemetry standard, which integrates the OpenCensus and OpenTracing standards. This standard combines metrics, tracing, and logging into one. This allows us to observe the operation of the monitoring system from three perspectives.

Prometheus and Jaeger, which are common OpenTelemetry open-source projects, are gradually improving their implementations according to this standard. The structure implemented by OpenTelemetry is shown in the following diagram:

In fact, distributed tracing systems and monitoring mainly provide the following supporting services:

Monitoring log standards
SDK for instrumentation (AOP or invasive)
Log collection
Distributed log transmission
Distributed log storage
Distributed retrieval and computation
Distributed real-time analysis
Customized dashboard for metrics
System warnings

I suggest using ELK to implement a distributed tracing system because it already provides the following functions:

Log collection (Filebeat)
Log transmission (Kafka + Logstash)
Log storage (Elasticsearch)
Retrieval and computation (Elasticsearch + Kibana)
Real-time analysis (Kibana)
Customized queries with tables (Kibana)

By defining the log format and using the SDK for instrumentation, we can achieve a distributed tracing, metrics, and log analysis system.

In fact, the biggest difference between the three monitoring systems: logging, metrics, and trace lies in the log format standard, but their underlying implementations are actually very similar. Since ELK provides us with the distributed-related services we need, I will briefly discuss the log format and SDK instrumentation. Through these two points, we can get a glimpse of the overall picture of distributed tracing.

TraceID Request Identifier #

To build a simple Trace system, the first thing we need to do is generate and pass the TraceID.

The principle of distributed link tracking is actually quite simple. When a request is initiated or a service is requested, a UUID is generated and passed along. During the requested service, any logs generated (Warning, Info, Debug, Error), any dependency resource requests (MySQL, Kafka, Redis), and any internal interface calls (Restful, Http, RPC) will carry this UUID.

In this way, when we collect all logs with the same UUID, we can link them together in the order of dependency requests based on time (with some error), RPCID (which will be introduced later), or SpanID.

As long as the logs are detailed enough, we can monitor most of the system’s work status. For example, we can determine how many interfaces are called when a user requests a service, the SQL queries for each data query, the specific parameters of the internal network request that consumed time, the response of the internal network request, the operations performed by the requested internal interface, and any exception information that occurred.

At the same time, by analyzing and classifying these logs, we can analyze the calling relationships between projects, the overall health of the projects, and automatically identify fault points in the link. This helps us proactively and quickly find problems.

“RPCID” VS “SpanID” #

So how do we connect the aggregated logs together? There are two ways: span and RPCID (hierarchical counter). We include both the UUID and the RPCID in the log records to help us link the log relationships. Now, what are the differences between these two methods?

Let’s first look at the span implementation, as shown in the following diagram:

Based on the diagram, let’s analyze the span’s chain dependency recording method. In code, many functions are encapsulated into functional modules (Service, Model). We combine different modules to implement business functions and record the dependencies between these two modules, two services, or resources.

The design of the span records the upstream dependent service’s SpanID to establish the relationship between upstream and downstream (stored in Parent ID). By organizing the dependencies between spans, we can form a call chain tree.

So what is RPCID? RPCID, also known as a hierarchical counter, is used by me in Weibo and Tal Education. To help you understand it better, let’s look at the following diagram:

As you can see, the implementation of the hierarchical counter for RPCID is straightforward. The first interface generates an RPCID as 1.1, where the prefix is 1 and the counter is 1 (log record is 1.1).

When the current interface requests other interfaces or data services (MySQL, Redis, API, Kafka), we increment the counter by 1 and include the value 1.2 in the request (because the prefix + “.” + counter value = 1.2). After receiving the response, when requesting the next resource, we continue incrementing the counter. Any logs generated in this process will record the prefix + “.” + counter value.

Each layer, after receiving the prefix, adds an incremented counter behind it. The actual effect is shown in the following diagram:

When the requested interface receives a request, if the request contains a TraceID, the requested service will continue to use the passed TraceID. If the request does not contain a TraceID, the service generates one on its own. Similarly, if an RPCID is passed, the requested service will use the passed RPCID as the prefix, and the counter starts from 1.

Compared to span, the RPCID created using this hierarchical counter has two advantages.

The first advantage is that we can log the requestor’s logs. If the requested party does not log, we can still analyze the performance of the called party (MySQL, Redis) using the logs from the requester.

Another advantage is that even if we don’t collect all the logs and some are lost, we can determine the approximate position of the logs by counting the number of separators in the prefix for rendering. For example, suppose we don’t know who called 1.5.1 in the diagram above, but based on its UUID and the information of layer 1.5.1, we can still render its approximate chain position.

In addition, we can utilize AOP to perform performance statistics and analysis on various modules, analyzing the time consumption and number of calls for each module periodically.

At the same time, through this dimension of sampling statistical data, we can help analyze the performance and error rate of this module. Since the log volume generated by Metrics is very small, some statistics only generate one Metrics log every 10 seconds, making it easy to compare the statistical values, which is very informative.

However, you should note that when a module has multiple branch logics, Metrics often takes the average value. Occasional timeouts cannot be easily identified from the average value. Therefore, we need to record the maximum and minimum delays separately to present them better. Moreover, these statistics only allow us to know if a module has performance issues but cannot help us analyze the specific causes.

Going back to the previous topic, we mentioned that the request and requested parties use TraceID and RPCID (or SpanID) to implement traceability. I’ll list a few common ways for you to refer to:

For the HTTP protocol, it can be included in the header.
For RPC protocols, it can be passed in the metadata.
For message queues, it can be included in the header of the message body or directly in the message body.
For other special cases, it can be passed through URL request parameters.

Now, how do we pass the TraceID between multiple threads and coroutines within an application? Generally, we make a copy of the Context and pass it into the thread or coroutine. If they were parallel before the copy, we need to increment the RPCID counter before handing it down and combine the prefix and counter into a new prefix to differentiate the parallel chains.

In addition, we have made some special designs. When our request carries a special passphrase and sets a header like X-DEBUG Header equal to 1, we can enable online debugging mode. The called interface and all dependent services will output logs of the debug level, making it more convenient for us to temporarily troubleshoot issues in production.

Definition of Log Types #

It can be said that as long as the log outputs the current TraceId and RPCID (SpanID), and passes the count to all dependent resources when requesting them, most of the distributed trace links are completed. Here are some customized log types and formats for your reference:

Log Types #

request.info: Information related to the currently requested interface, such as the requested interface, duration, parameters, return values, and client information.
mysql.connect: Duration of mysql connection.
mysql.connect.error: Error information for mysql connection.
mysql.request: Duration and related information when executing mysql query commands.
mysql.request.error: Error information when performing mysql operations.
redis.connect: Duration of redis connection.
redis.connect.error: Error information for redis connection.
redis.request: Executing redis command.
redis.request.error: Errors during redis operations.
memcache.connect
memcache.connect.error
memcache.request.error
http.get: Supporting RESTful operations such as get, put, delete
http.post
http.*.error

Metric Log Types #

metric.counter …omitted

Log Level Types #

log.debug: Debug log.
log.trace: Trace log.
log.notice: Notice log.
log.info: Info log.
log.error: Application error log.
log.alarm: Alarm log.
log.exception: Exception log.

You will find that there are relevant logs for all requests to dependent resources, which can help us analyze the time consumption and return content of all dependent resources. In addition, our log level types are also within the scope of trace tracking. Through the log information, we can better analyze issues. Moreover, if we are monitoring static languages, as mentioned earlier, we can also do metrics for some modules and generate logs periodically.

Log Format Example #

It is recommended to use the JSON format for logs. All fields, except those marked as “string”, should be saved as string types. Each field must have a fixed data type, and optional content should not be output if there is no content.

This design is actually to adapt to Elasticsearch+Kibana. Kibana provides log aggregation, retrieval, conditional retrieval, and numerical aggregation, but it is sensitive to field format. If it is not a numeric type, it cannot be aggregated and compared.

Below is an example for log tracing and monitoring. You should mainly focus on its type and the purpose of each field.

{
    "name": "string: Introduction to all fields, required. Used to distinguish log types, write the content of the log list above here.",
    "trace_id": "string: Trace ID, required.",
    "rpc_id": "string: RPC ID, required for server-side trace, not required for client side.",
    "department": "string: Department abbreviation, such as client_frontend, required.",
    "version": "string: Current service version, recommended to be filled, e.g. cpp-client-1.1, php-baseserver-1.4, java-rti-1.9.",
    "timestamp": "int: Timestamp of the log, in seconds, required.",
    
    "duration": "float: Duration, floating point number, in seconds, if available, fill it.",
    "module": "string: Module path, recommended format: application_name_module_name_function_name_action, required.",
    "source": "string: Request source, for web pages, record ref page, optional.",
    "uid": "string: Current user UID, if not available, fill it with a zero-length string, optional. Can help analyze user behavior over a period of time.",
    "pid": "string: Process PID, if not available, fill it with a zero-length string. If there are threads, it can be in the format of pid-tid, optional.",
    "server_ip": "string: Current server IP, required.",
    "client_ip": "string: Client IP, optional.",
    "user_agent": "string: User Agent, curl/7.29.0, optional.",
    "host": "string: IP and port number of the target link, used to distinguish environment, e.g. 12.123.23.1:3306, optional.",
    "instance_name": "string: Identifier for database connection configuration, e.g. database connection for rti, optional.",
    "db": "string: Database name, e.g. peiyou_stastic, optional.",
    "code": "string: Various driver, error, or service error codes, optional. Required when reporting errors.",
    "msg": "string: Error message or other prompt information, optional. Required when reporting errors.",
    "backtrace": "string: Backtrace information of the error, optional. Required when reporting errors.",
    "action": "string: Can be URL, SQL, Redis command, any command executed remotely, required.",
    "param": "string: Common parameter template, used with script, record all request parameters, required.",
    "file": "string: userinfo.php, optional.",
    "line": "string: 232, optional.",
    "response": "string: Result returned by the request, can be data returned by this interface or other resources. If the data is too long, it may affect performance. Optional.",
    "response_length": "int: Length of the response content result, optional.",
    "dns_duration": "float: DNS resolution time, usually appears when requesting the domain in HTTP or MySQL. Optional.",
    "extra": "json: Can put anything you want, all user additional data can be put here."
}

Example #

Request Log

{
    "x_name": "request.info",
    "x_trace_id": "123jiojfdsao",
    "x_rpc_id": "0.1",
    "x_version": "php-baseserver-4.0",
    "x_department": "tal_client_frontend",
    "x_timestamp": 1506480162,
    "x_duration": 0.021,
    "x_uid": "9527",
    "x_pid": "123",
    "x_module": "js_game1_start",
    "x_user_agent": "string curl/7.29.0",
    "x_action": "http://testapi.speiyou.com/v3/user/getinfo?id=9527",
    "x_server_ip": "192.168.1.1:80",
    "x_client_ip": "192.168.1.123",
    "x_param": "json string",
    "x_source": "www.baidu.com",
    "x_code": "200",
    "x_response": "json:api result",
    "x_response_len": 12324
}

MySQL Connection Performance Log #

{
    "x_name": "mysql.connect",
    "x_trace_id": "123jiojfdsao",
    "x_rpc_id": "0.2",
    "x_version": "php-baseserver-4",
    "x_department": "tal_client_frontend",
    "x_timestamp": 1506480162,
    "x_duration": 0.024,
    "x_uid": "9527",
    "x_pid": "123",
    "x_module": "js_mysql_connect",
    "x_instance_name": "default",
    "x_host": "12.123.23.1:3306",
    "x_db": "tal_game_round",
    "x_msg": "ok",
    "x_code": "1",
    "x_response": "json:****"
}

MySQL Request Log #

{
    "x_name": "mysql.request",
    "x_trace_id": "123jiojfdsao",
    "x_rpc_id": "0.2",
    "x_version": "php-4",
    "x_department": "tal_client_frontend",
    "x_timestamp": 1506480162,
    "x_duration": 0.024,
    "x_uid": "9527",
    "x_pid": "123",
    "x_module": "js_game1_round_sigup",
    "x_instance_name": "default",
    "x_host": "12.123.23.1:3306",
    "x_db": "tal_game_round",
    "x_action": "select * from xxx where xxxx",
    "x_param": "json string",
    "x_code": "1",
    "x_msg": "ok",
    "x_response": "json:****"
}

HTTP Request Log #

{
    "x_name": "http.post",
    "x_trace_id": "123jiojfdsao",
    "x_department": "tal_client_frontend",

        "x_rpc_id": "0.3",
        "x_version": "php-4",
        "x_timestamp": 1506480162,
        "x_duration": 0.214,
        "x_uid": "9527",
        "x_pid": "123",
        "x_module": "js_game1_round_win_report",
        "x_action": "http://testapi.speiyou.com/v3/game/report",
        "x_param": "json:",
        "x_server_ip": "192.168.1.1",
        "x_msg": "ok",
        "x_code": "200",
        "x_response_len": 12324,
        "x_response": "json:responsexxxx",
        "x_dns_duration": 0.001
    }
    
    ### level log info日志
    {
        "x_name": "log.info",
        "x_trace_id": "123jiojfdsao",
        "x_department":"tal_client_frontend",
        "x_rpc_id": "0.3",
        "x_version": "php-4",
        "x_timestamp": 1506480162,
        "x_duration": 0.214,
        "x_uid": "9527",
        "x_pid": "123",
        "x_module": "game1_round_win_round_end",
        "x_file": "userinfo.php",
        "x_line": "232",
        "x_msg": "ok",
        "x_code": "201",
        "extra": "json game_id lesson_num  xxxxx"
    }
    
    ### exception 异常日志
    {
        "x_name": "log.exception",
        "x_trace_id": "123jiojfdsao",
        "x_department":"tal_client_frontend",
        "x_rpc_id": "0.3",
        "x_version": "php-4",
        "x_timestamp": 1506480162,
        "x_duration": 0.214,
        "x_uid": "9527",
        "x_pid": "123",
        "x_module": "game1_round_win",
        "x_file": "userinfo.php",
        "x_line": "232",
        "x_msg": "exception:xxxxx call stack",
        "x_code": "hy20001",
        "x_backtrace": "xxxxx.php(123) gotError:..."
    }
    
    ### 业务自发告警日志
    {
        "x_name": "log.alarm",
        "x_trace_id": "123jiojfdsao",
        "x_department":"tal_client_frontend",
        "x_rpc_id": "0.3",
        "x_version": "php-4",
        "x_timestamp": 1506480162,
        "x_duration": 0.214,
        "x_uid": "9527",
        "x_pid": "123",
        "x_module": "game1_round_win_round_report",
        "x_file": "game_win_notify.php",
        "x_line": "123",
        "x_msg": "game report request fail! retryed three time..",
        "x_code": "201",
        "x_extra": "json game_id lesson_num  xxxxx"
    }
    
    ### matrics 计数器
    
    {
        "x_name": "metrix.count",
        "x_trace_id": "123jiojfdsao",
        "x_department":"tal_client_frontend",
        "x_rpc_id": "0.3",
        "x_version": "php-4",
        "x_timestamp": 1506480162,
        "x_uid": "9527",
        "x_pid": "123",
        "x_module": "game1_round_win_click",
        "x_extra": "json curl invoke count"
    }

This log can not only be used on the server side but also on the client side. When the client is clicked or triggered, it can generate a new TraceID and send it when requesting the server. Through this log, we can analyze the performance of accessing services in different regions, and it can also be used as user behavior logs by simply adding our log type.

The above log example describes the resource dependence situation we rely on very clearly. In addition, I would like to add a tip. Performance logging can also record the requested interface as a log, recording its own time consumption and other information, which is convenient for comparing with the request log of the requester later, so as to analyze whether there are network delays and other issues between the two.

In addition to this, there is a core point in this design: development does not necessarily strictly comply with the above field rules to generate logs. As long as the business ensures that all required items (TraceID, RPCID/SpanID, TimeStamp) are output in the logged log and ensures that the functional and type of numeric fields are stable, trace can be achieved.

After summarizing the logs, we can interpret the different log fields ourselves and customize the statistical analysis required by different businesses. This is the most powerful feature of ELK.

Why are most designs to log resource dependencies? This is because in the absence of IO, most programs are controllable (except for services that focus on calculation). Only IO operations are prone to unstable factors, and logging the operations on data sources can help us debug business logic errors.

We mentioned earlier that excessive logs will affect the performance of the interface. How can we improve the write throughput of logs? Here are a few tips and tricks for you:

Increase the number of write threads. One thread writes one log. You can also place each log file on a separate disk. But you should be careful not to exceed 100 IOPS of the system.
When the length of the log exceeds 1KB, do not use multiple threads to write to the same file in high concurrency. The reason is explained in append is not Atomic. In short, the append operation of files is not atomic for operations that write lengths longer than the length of the buffer. Writing long content to the same file concurrently with multiple threads will cause log disorder.
Logs can be temporarily stored in memory and written to the disk only when a certain amount of data is accumulated or when the cache exceeds 2 seconds. This can reduce the number of system calls for writing small logs to the disk, but the cost is that logs may be lost when the program is abruptly terminated.
The log cache should be pre-allocated with a fixed length, not frequently allocated and recycled, which can cause the system to slow down as a whole.
When the service is killed, remember to intercept the signal and quickly fsync the logs in memory to the disk to reduce the possibility of log loss.

“Invasive Buried Point SDK” VS “AOP Buried Point” #

Finally, let’s talk about SDKs. In fact, using “ELK+custom standards” can already achieve most distributed tracing systems. Using Kibana, various logs can be quickly aggregated, analyzed, and statistically summarized.

Although there have been many companies in the industry that provide distributed tracing system services and have developed many APM and similar products, only a few of them have been widely adopted. I believe the following factors contribute to this:

Distributed tracing logs have high throughput and consume a large amount of resources, resulting in high costs.
It is difficult for general distributed tracing services to be tailored to specific business needs. Third-party services that cannot be customized are not as good as open-source solutions.
The buried point library for distributed tracing has a large impact on the code. It needs to be manually embedded into the business code, which is cumbersome and not flexible enough.
In addition, this approach also has language limitations. Currently, only Java can achieve static language AOP injection through dynamic startup injection of agents. When I promoted it before, I unified the open-source frameworks of intranet projects to achieve unified tracing.

So, how can we implement distributed tracing in a simple and straightforward way if the underlying code cannot be updated?

At this time, we can modify the hierarchical logging so that each time it is written, it includes the TraceId and RPCID (or SpanID). This will achieve good results. If the data is well encapsulated, we can write logs in the request initiation section that meet standard performance requirements. We can also inject our standardized tracing in the framework’s unified exception handling, which enables monitoring of key points.

Of course, if conditions permit, it is best to provide a standardized SDK for business development partners to call on demand. This can help us unify the log structure. After all, manual writing is easy to format incorrectly and requires manual sorting. However, even if it is messy, there is still a pattern. This is the power of the ELK architecture. Its full-text search function actually does not care about the input format, but data statistics require us to ensure that the usage of each field is fixed.

Finally, let’s discuss some other considerations for logging. You may have noticed that this design logs all events. Many distributed tracing systems actually use sampling methods. For example, Jaeger deploys an agent locally in the application, which temporarily stores and aggregates data, and then calculates the average response time for each interface. Requests with the same characteristics are classified and summarized, which considerably reduces the server-side load.

However, there are also disadvantages to this approach. If we have some rare business logic errors, they may be missed during sampling. Therefore, many core systems record full logs, while peripheral businesses record sample logs.

Since our log structure is very simple, if needed, you can implement a functionality similar to an agent to reduce storage and computing pressure. We can even save the original logs locally on the server for 7 days. When we need to find a specific trace log, we can directly request all servers to search locally. In fact, in situations where there are many writes and few reads, it is acceptable to request data from 200 servers just to trace a detailed process, even if it takes ten seconds.

Finally, to facilitate understanding, I provide you with an example of an AOP trace SDK for the Laravel framework that I previously wrote. You can refer to it here: laravel-aop-trace.

Summary #

System monitoring has always been a key focus for server-side development. We often analyze and troubleshoot online issues based on traceability and process logs. In other words, the closer monitoring is to the business and the more customized it is, the more intuitive our understanding of the operation of the online business will be.

However, implementing a monitoring system that is more in line with the business is not easy. Basic operations and monitoring only cover fundamental metrics such as online request traffic, response speed, system errors, and system resources. When we need to monitor the business itself, we still need to embed a large amount of code manually into the business system. Furthermore, because these services are open source, we are required to have a deep understanding of monitoring and invest a lot of effort.

Fortunately, as technology matures, general-purpose simple log transmission, indexing, and statistics services are becoming popular. The strongest combination among them is ELK. Through such distributed log technology, we can easily achieve personalized monitoring requirements. The log format can be messy, but as long as we pass on the TraceID and RPCID (or SpanID) when requesting dependent resources, and record the corresponding fields along the way, it doesn’t matter. It is precisely because of this that ELK has become popular, and many companies rely on ELK to customize their own monitoring systems for their core businesses.

However, doing so only allows us to establish a rough tracking system. The difficulty and cost of subsequent analysis are still high because ELK requires a large amount of hardware resources to process massive data. We will explore related knowledge in subsequent chapters.

Discussion Questions #

Please take a moment to think about why it was so difficult to implement trace in the past, when it is so simple now with ELK.

Feel free to leave your thoughts and discuss with me in the comments section. See you in the next class!