13 Real Time Statistical Tracing Practical Algorithms in Real Time Calculation Tracing

13 Real-Time Statistical Tracing Practical Algorithms in Real-Time Calculation Tracing #

Hello, I’m Xu Changlong.

In the previous lessons, we learned about the ELK architecture and how to quickly implement a custom distributed tracing system using it. However, ELK is a large system, and the prerequisite for using it is that we have at least three high-performance servers.

If our data volume is large, we would need even more server resources. The largest scale we had before was around 2000 servers for ELK. But what if our server resources are scarce? How can we implement performance analysis, statistics, and monitoring in such a scenario?

At that time, I only had two servers with 4 cores and 8 GB of RAM each. So I used some clever algorithms to achieve functionality that would normally require a large number of servers for parallel computing. In this lesson, I will share these algorithms with you.

First, I will show you the overall structure diagram of real-time computing to help you get a general impression.

Image

From the diagram above, you can see that the data for real-time computing is pulled from Kafka, and the process performs real-time computing and statistics on the grouped consumption of Kafka. Next, let’s take a closer look at the ideas and functions behind these algorithms.

Aggregating URLs without Parameters #

Those who do link tracking often find it difficult to remove parameters from URLs. The main reason is that many people use the RESTful approach to design internal interfaces. However, when performing link tracking or statistical analysis on an API level, it is not feasible to directly input these parameterized URLs into the statistical analysis system without organizing them.

Because the same API cannot be categorized due to different parameters, this eventually leads to non-unique URLs. The aggregation of thousands of “different” URLs will cause the statistical system to crash due to exhaustion of resources. In addition, different method operations on the same URL are actually different implementations in RESTful. Therefore, the same URL does not represent the same interface, which further complicates the classification and statistics.

To help you understand, here are a few examples of RESTful implementations:

  • GET geekbang.com/user/ 1002312 /info Get user information
  • PUT geekbang.com/user/ 1002312 /info Modify user information
  • DELETE geekbang.com/user/ 1002312 /friend/ 123455 Delete user friend

As you can see, our URLs have parameters. Although they are the same URLs, the meanings represented by GET and PUT methods are different. This issue arises when using tools like Prometheus and Trace.

Generally, when encountering this kind of problem, we would first organize the data and then input it into the statistical analysis system. There are two common ways to remove parameters from URLs.

The first method is manually configuring replacement templates. This means manually configuring a URL rule to filter out logs that match the rule and replace the key parts.

I usually use a data structure similar to a trie tree to store the list of URL replacements, which improves the search speed. However, this method also has its drawbacks, as it requires manual maintenance. If the development team exceeds 200 people, the list needs frequent updates, which can be cumbersome to maintain.

Class Radix tree effect:
/user
 - /*
 -  - /info
 -  -  - :GET
 -  -  - :PUT
 -  - /friend
 -  -  - /*
 -  -  -  - :DELETE

The specific implementation is to split the URL using “/”, and then search in the prefix tree level by level.

Let me give you an example. Suppose we request GET /user/ 1002312 /info. When using the tree for retrieval, we can first find the root node “/user”. Then, in the child nodes of “/user”, we find an element “/” (which means that it can be replaced here), and there are no other matches at the same level. So, we record it as “Replaced here”. Next, we continue to search in the child nodes of “/” and find “/info”. At this point, the URL is completely matched.

At a deeper level in the URL is the specific request method. We find the GET operation, which completes the matching of this URL. Then, we can simply replace the “1002312” part of “/*” with a fixed string. The replacement effect is shown below:

GET /user/1002312/info is replaced with /user/replaced/info

Another method is data feature filtering, which, although having some errors, is simple to implement and does not require manual maintenance. This is the method I prefer, because although there may be errors with this method, it is indeed more convenient than the first method.

Please refer to the demonstration code below:

// Filter URL parameters based on data features
function filterUrl($url)
{
    $urlArr = explode("/", $url);

    foreach ($urlArr as $urlIndex => $urlItem) {
        $totalChar = 0; // Number of letters
        $totalNum = 0; // Number of digits
        $totalLen = strlen($urlItem); // Total length

        for ($index = 0; $index < $totalLen; $index++) {
            if (is_numeric($urlItem[$index])) {
                $totalNum++;
            } else {
                $totalChar++;
            }
        }

        // Filter out MD5 with a length of 32 or 64 and a combination of numbers and characters
        if (($totalLen == 32 || $totalLen == 64) && $totalChar > 0 && $totalNum > 0) {
            $urlArr[$urlIndex] = "*md*";
            continue;
        }

        // String data parameter is a combination of numbers and English letters, longer than 3 characters (to avoid things like v1/v2 versions)
        if ($totalLen > 3 && $totalChar > 0 && $totalNum > 0) {
            $urlArr[$urlIndex] = "*data*";
            continue;
        }

        // If all digits, it is considered an ID in the URL and can be directly replaced
        if ($totalChar == 0 && $totalNum > 0) {
            $urlArr[$urlIndex] = "*num*";
            continue;
        }
    }
    return implode("/", $urlArr);
}

With these two methods, we can easily replace our URLs like this:

  • GET geekbang.com/user/ 1002312 /info => geekbang.com/user/ num /info_GET
  • PUT geekbang.com/user/ 1002312 /info => geekbang.com/user/ num /info_PUT
  • DELETE geekbang.com/user/ 1002312 /friend/ 123455 => geekbang.com/user/ num /friend/ num _DEL

After filtering, doesn’t our API list look much cleaner? This makes it more convenient to perform API aggregation and statistical analysis.

Time Chunk Statistics #

After removing the URL parameters, we can perform performance statistics on different interfaces. I am using the method of time chunk to implement this. I designed it this way because my log consumption service has limited memory available (only 8GB), and if we save too much data to the database, the real-time update efficiency will be very low.

After careful consideration, I chose to save the statistics of data within a certain time period by dividing it into time chunks and aggregating them in memory.

In order to display the data better, I divide each day into 24 hours and divide each hour into 15-minute time chunks. Each time chunk will contain the aggregated interface data for that time period, forming a data statistical block.

As a result, there will be 96 data statistical blocks in one day (calculated as: 86400 seconds / (15 minutes * 60 seconds) = 96). If there are 200 APIs, the amount of data stored in memory for one day will be 19200 entries (96X200 = 19200).

image

Assuming we are monitoring 200 interfaces in the system, we can estimate that the amount of statistical data for one year will be around 7 million entries. If necessary, we can make the granularity even smaller.

In fact, many metrics monitoring tools on the market have a time chunk granularity of 3 to 5 seconds. It was not until recent years that OLAP and time-series databases were introduced, which allowed for performance statistics at a second-level granularity. The smaller the granularity, the more detailed the monitoring, while a larger granularity can only show the average performance for a period of time.

I also want to mention something off-topic. In the past two years, influxDB or Prometheus have appeared, and it is possible to use them to save data. However, these methods require hardware investment and maintenance costs. You can weigh them against your own business situation.

Let’s take a look at what content is being aggregated for the URL in a time chunk of 15 minutes.

image

As shown in the above figure, each data statistical block aggregates the following metrics:

  • Total number of requests
  • Slowest response time
  • Fastest response time
  • Average response time
  • Number of response times, using quartile analysis provided by ELK (If unable to calculate quartiles with complete data, you can set it to: the number of requests with response times less than 200ms, 500ms, 1000ms, and more than 1 second)
  • HTTP codes of API responses and their corresponding numbers (e.g., {“200”: 1343, “500”: 23, “404”: 12, “301”: 14})

Displaying these metrics is mainly for analyzing the performance of the interface. After reading this, do you have any questions about whether it is meaningful to go through such a hassle to monitor these details?

Indeed, in most cases, our APIs perform well, and it is only in certain special situations that individual interfaces may respond slowly. However, besides monitoring for widespread failure problems, we must not ignore potential issues with minor failures. Especially for high-throughput servers, it is even more difficult to discover these minor failures.

We can only support troubleshooting for minor issues through monitoring in order to detect these rare failures in advance. In extreme cases, these rare failures can cause the cluster to crash. Therefore, discovering and handling them in advance can ensure that our online system does not suddenly crash when facing high traffic and concurrency.

Error Log Clustering #

After monitoring and collecting statistical information, we also need to pay attention to error logs. When it comes to troubleshooting, we must mention the method of error log clustering.

We all know that common online failures are often accompanied by a large number of error logs. Faced with a massive number of warnings, we not only need to obtain the latest error messages but also cannot miss individual important but infrequently occurring failures.

Due to limited resources, we cannot store too many error logs in memory. Therefore, the solution of log clustering is a good choice, which classifies errors through log aggregation and assists users in troubleshooting. By doing this, we can not only discover errors but also provide error templates to speed up the troubleshooting process.

Here is how I implemented the log error clustering feature: I directly calculate the similarity of logs by performing approximate matching, supplemented with auxiliary fields for adjustment. This feature can group together logs that belong to the same error category but have different individual parameters, making it convenient for us to quickly discover low-frequency failures.

Implementing error monitoring through this method has additional benefits. With it, there is no need for a unified log format standard for the entire platform. Instead, it can easily adapt to various log formats, greatly facilitating our monitoring of different systems.

Speaking of this, are you curious about the implementation details? Below is an example of text similarity using simhash, provided by github.com/mfonda/simhash:

package main
import (
   "fmt"
   "github.com/mfonda/simhash"
)
func main() {
   var docs = [][]byte{
      []byte("this is a test phrass"), // test string 1
      []byte("this is a test phrass"), // test string 2
      []byte("foo bar"), // test string 3
   }
   hashes := make([]uint64, len(docs))
   for i, d := range docs {
      hashes[i] = simhash.Simhash(simhash.NewWordFeatureSet(d)) // compute the hash value for the corresponding test string
      fmt.Printf("Simhash of %s: %x\n", d, hashes[i])
   }
   // Compare test string 1 with test string 2
   fmt.Printf("Comparison of  0 1 : %d\n", simhash.Compare(hashes[0], hashes[1]))
   // Compare test string 1 with test string 3
   fmt.Printf("Comparison of  0 2 : %d\n", simhash.Compare(hashes[0], hashes[2]))
}

After reading the code, let me explain the approach.

We can use a resident process that continuously consumes Kafka log information as a group consumer. Whenever an error log is encountered during consumption, it needs to be converted into a 64-bit hash using simhash. Then, by traversing and comparing with the existing list of error types, if the log length is similar and the Hamming distance (as calculated by simhash.Compare) does not exceed a difference of 12 bits, it can be categorized into the same class.

Please note that due to algorithm limitations, simhash has significant errors for texts shorter than 100 characters. Therefore, we need to test the specific runtime behavior and make adjustments accordingly. When the texts are exceptionally short, we need some other assistive methods for deduplication. Also, note that a matching rate of more than 80% is required for texts shorter than 100 characters, while a matching rate of more than 90% is required for texts longer than 100 characters.

Lastly, besides log similarity detection, code file names, line numbers, and text lengths can be used as additional auxiliary factors to assist in judgement. This can reduce errors since it is a fuzzy match.

Next, we need to display the errors that have been classified.

Here are the specific steps: If the current log matches any existing error type, we save the contents of the first occurrence as well as the last three occurrences of the error log.

On the error classification page, we can view the recent occurrence time, count, start time, and the first occurrence of the error log. We can also directly jump to the Trace rendering page via the Trace ID (this approach is very helpful for troubleshooting, you can try it out in my implementation of Java standalone version).

In fact, there is still a lot of room for optimizing error deduplication. For example, if we have already counted more than a thousand error types in memory, then each new error log’s hash needs to be compared with each of these 1000 types, which invisibly consumes a large amount of CPU resources.

For this situation, there are some simple tricks available online. For example, dividing the 64-bit hash into two parts and comparing the first half first. If the similarity is high, then compare the second half.

This technique is called log aggregation, but it is not widely used in the industry.

Cloud providers also offer similar functionality, but it is rarely used in the field of error deduplication. I believe there is still potential to be explored here. When there is sufficient computational power, the industry commonly uses K-Means or DBSCAN algorithms for log aggregation. If you are interested, you can further explore this area.

Bitmap Implementation for Frequency Counting #

Although we have identified misclassifications, we still haven’t resolved the issues of how long these errors occurred and whether they are still happening online.

Normally, we would record these logs one by one in an OLAP-based statistical analysis system and aggregate the statistics by time partition. However, this method requires a significant amount of computational power, which we lack. Are there other ways to represent this?

Here, I used a small trick: after the first occurrence of an error, I used a bit in a bitmap to represent each second.

If the same type of error occurs within the same minute, I record it as 1. By doing this, we can use the bitmap to trace the time period in which errors occurred, allowing us to quickly track the frequency cycle using a small amount of memory.

However, this approach introduces a new problem — significant memory waste. This is because the error statistics are maintained in memory according to error categories. On average, a new business may have thousands of different errors daily, which requires me to save 1350x1000 int64 types in memory.

To save memory usage, I replaced the bitmap implementation with a Roaring bitmap. It can compress the space occupied by the bitmap and is more effective in compressing continuous and similar data. In fact, there are even more use cases for bitmaps. They can be used for various interesting annotations, and compared to traditional data structures, they can save more memory and storage space.

Summary #

In this lesson, I shared with you four practical algorithms, all of which I have verified through practice. You can review and memorize them by combining them with the image below.

Image

To solve the problem of clustering website URLs caused by different parameters, you can organize the URLs through configuration or data feature filtering methods. You can also reduce the amount of statistical results by using time blocks.

To analyze a large number of error logs, the simhash algorithm is a good choice. It can be used in conjunction with a bitmap to record the frequency of error log occurrences. With the help of these algorithms, you can achieve fault monitoring and aggregation analysis of online services with minimal system resources, and visually display the working status of the services.

After learning this lesson, have you realized that it is very interesting to use simple algorithms to achieve services that previously required dozens of distributed servers, especially in situations where resources are scarce?

Even in modern times, with the development of the Internet in recent years, there are still many scenarios that require special designs to help us reduce resource consumption. For example, using Bloom Filter to reduce scanning times, using Redis’s hyperLogLog to approximate counting for large amounts of data, and using GEO hash for map block partitioning and statistical analysis. If you are interested, you can further study the Redis Modules after class.

Thought Questions #

Based on the algorithms and approaches discussed in this lesson, how can SQL perform aggregation, grouping, and deduplication?

Feel free to engage in discussion and exchange ideas with me in the comment section. See you in the next class!