30 Data Reporting Part1 How to Implement Highly Available Reporting Components

30 Data Reporting Part1 How to Implement Highly Available Reporting Components #

Whether it is real-time monitoring in “Efficient Testing” or a data validation platform in “Version Release,” I have mentioned the importance of data many times.

For data assessment, our expectation is to be “both fast and accurate.” “Fast” means the timeliness of the data. We hope to be able to evaluate the data within 1 hour, or even within 1 minute, instead of waiting for 1 day or several days. “Accurate” means the accuracy of the data, ensuring that it reflects the real situation of the business and prevents making wrong product decisions due to inaccurate data.

However, “you can’t make bricks without straw.” The accuracy and timeliness of the data platform depend on the ability of the client’s data collection and reporting component. So, how can we ensure the real-time and accuracy of the client reporting component? How can we achieve a “highly available” reporting component?

Unified and Highly Available Reporting Component #

You may wonder what a “highly available” reporting component is. In my opinion, it should achieve at least three goals:

Data loss prevention: Data should not be lost due to application crashes or abnormal system shutdowns.
High real-time performance: Whether in the foreground or background, all data should be reported promptly within a short period of time.
High performance: This mainly involves two dimensions: lag and traffic. The reporting component should not cause lag due to excessive CPU and I/O usage, nor should it result in excessive traffic consumption due to poor design.

However, the integrity, real-time performance, and performance of the data are like the two ends of a balance. We cannot achieve the best in all three aspects simultaneously. Therefore, while considering performance, we can only do our best to ensure that data is not lost and minimize the reporting delay.

In the “network optimization” section, I mentioned the need for a unified network library more than once. As an important foundation component, whether it is different business modules within an application or across different Android and iOS platforms, they should all use the same network library.

Similarly, the reporting component is also an important foundation component of an application. We hope to build a unified and highly available reporting component.

The process of data tracking mainly includes sampling, storage, reporting, and resilience. Now let’s break down each module and explore the difficulties involved.

1. Sampling Module

For some client data, the volume may be very large, but we do not need to report all of it to the backend. For example, for performance data such as lagging and memory usage, we only need to collect statistics from a small number of users.

The sampling module is often overlooked by many developers during the design phase, but it is the most complex module out of all. It requires consideration of various strategy choices, some of which are mentioned below.

Most components use PV (Page Views) sampling, which is indeed the simplest approach. However, for performance data tracking, to minimize the impact on users, I am inclined towards using UV (Unique Views) sampling. I also hope to have a new batch of users to report every day.

In the end, the solution I chose is “UV sampling + random user identifier + daily user rotation”. However, sampling still needs to meet three criteria.

Accuracy: If the sampling ratio is set to 1%, it is necessary to ensure that only 1% of users will report this data at any given moment.
Uniformity: If the sampling ratio is set to 1%, a different 1% of users should be rotated daily to report this data.
Smooth switching: User switching should be smooth, without simultaneous switching at a specific time (e.g., 12:00), which would lead to inconsistent background data.

Implementing these three criteria is not easy. In WeChat, we adopted the following algorithm:

// id: user identifier, such as WeChat ID, QQ ID
id_index = Hash(id) % reciprocal of the sampling ratio
time_index = (unix_timestamp / (24*60*60)) % reciprocal of the sampling ratio
report_user = (id_index == time_index)

Each sampling lasts for 24 hours, ensuring a smooth rotation without all users changing the sampling strategy at midnight. Some users switch at 10:00 in the morning, while others switch at 11:00, spreading the rotation throughout the 24-hour period. From an hourly or daily perspective, it also ensures the accuracy of sampling.

Different tracking points can have different sampling rates, and they are independent and do not affect each other. In addition to the sampling rate, we can also add other control parameters to the sampling strategy, such as:

Reporting interval: Configuring the reporting interval for each tracking point, such as 1 second, 1 minute, 10 minutes, 60 minutes, etc.
Reporting network: Controlling certain points to only allow uploads via WiFi. 2. Storage Module

For the storage module, our goal is to ensure data integrity while balancing performance. So how can we achieve this? First, we need to consider the choice of processes and storage modes.

The most common reporting component in the industry is “single-process write + file storage + memory cache”. Although this approach is the simplest to implement, both cross-process IPC call backlog (IPC calls are always slow) and memory cache can lead to data loss.

Let’s review the data comparison between mmap, memory, and file writing that I listed in the “I/O optimization”.

As you can see, mmap performs very well. Therefore, we ultimately chose the solution of “multi-process write + mmap” and completely abandoned the memory cache. However, the performance of mmap is not perfect either. It still has asynchronous writing in some moments, so each process’s mmap operation needs to be handled in a separate thread.

The solution of “multi-process write + mmap” can achieve lock-free, IPC-free, and minimal data loss. It seems perfect, but is it really that simple to implement as shown in the figure? Definitely not that simple, because we need to consider data aggregation and reporting data priorities.

Data Aggregation: To reduce the amount of data reported, especially for some performance monitoring, we need to support data aggregation. Most components use aggregation during reporting, but this cannot solve the issue of data volume during storage. Since we are using mmap, we can manipulate the data in the file as if it were in memory, which allows us to achieve more efficient aggregation of performance data.
Reporting Data Priorities: Many reporting components use a parameter to determine the importance of the data being recorded and choose to write the important data directly. For our solution, all data is considered important by default. As for the priority of reporting data, I suggest using reporting intervals, such as 1 minute, 10 minutes, or 1 hour.

For some sensitive data, encryption support may also be needed. For encrypted data, I recommend using a separate mmap file for storage.

Why did I say that the data will basically not be lost, rather than completely not lost? Because if the data has not yet been written to mmap and is still in the sampling or internal storage logic stage, even if the application crashes, data loss can still occur. To minimize this situation, we have made two optimizations.

Simplified Processing Logic: Try to reduce the processing time for each data point to within 0.1 milliseconds.
Wait for KillProcess: Before actively executing KillProcess, a separate function needs to be called to wait for all data points in the queue to be processed.

3. Reporting Module

For the reporting module, we not only need to meet real-time reporting requirements, but also optimize the use of traffic. The main strategies to consider are:

To address the real-time reporting issue of the background process, we adopted a strategy of single-process reporting. I recommend using a process with strong keep-alive capabilities as the sole reporting process. To achieve more precise control of the reporting interval, we have adopted a more complex shuttle bus scheduling system.

After careful consideration, we finally adopted the solution of “multi-process write + single-process reporting” for the reporting module. One difficulty here is how to collect all the shuttles that have stopped in a timely manner. Will there be a synchronization issue between multiple processes? We achieve this with the atomicity of Linux’s file renaming and the FileObserver mechanism, implementing a completely lock-free and high-performance file synchronization model.

When each process “stops” at the corresponding priority file, it is responsible for renaming the file to the directory where the reporting data is stored. Since renaming is an atomic operation, there is no need to worry about two processes operating on the same file simultaneously. The corresponding reporting process only needs to monitor changes to the reporting data directory to achieve file state synchronization. This avoids the problem of multiple processes synchronizing operations on the same file, and the entire process does not require cross-process locks.

Of course, there are still many challenges in the reporting module, such as giving priority to higher-priority files when merging them; for the size of reported data, it should be smaller on cellular networks than on WiFi, and files of different priorities should be combined to maximize bandwidth utilization; and in weak networks, smaller data packets should be used and the highest-priority data should be reported first.

4. Disaster Recovery Module

Although we have designed a powerful reporting module, if it is not used properly by the user, it can still lead to serious performance issues. I’ve encountered situations where a classmate buried one million points continuously in a for loop, and another situation where a user accumulated a large amount of data locally due to a lack of network connection over a long period of time.

A powerful component also needs to have the ability to withstand disasters. In a local environment, we can implement the following strategies.

The disaster recovery module mainly ensures that even in cases of developer errors, component internal exceptions, and so on, there won’t be serious problems with user storage space and traffic usage.

Data Self-Monitoring #

Through “multi-process writing + mmap + background process reporting + shuttle mode”, we have implemented a high-performance reporting component that is completely lock-free, with a negligible data loss rate, and no inter-process IPC calls. Moreover, through a fault tolerance mechanism, it can also automatically recover from exceptional situations.

But is the online performance really that perfect? How do we ensure the reliability and timeliness of the reporting component? The answer is still monitoring. We need to establish a comprehensive self-monitoring system to provide reliable data support for further optimization.

1. Quality Monitoring

The core data indicators of the reporting component mainly include the following:

Of course, if we pursue higher real-time performance, we can choose to calculate the hourly arrival rate, or even the minute arrival rate.

2. Fault Tolerance Monitoring

When a client encounters fault tolerance processing, we also separately report the data to the background for monitoring.

In addition to monitoring exceptional situations, we also want to monitor in a more granular way the partitioned intervals of the daily mobile data usage and WiFi data usage by users, such as the percentage of data between 0 to 1MB, the percentage of data between 1 to 5MB, and so on.

Summary #

The internet and data are both extremely important foundational components. Today, we have built a cross-platform and highly available reporting component together. This is currently one of the more advanced solutions, with significantly improved quality indicators compared to traditional solutions.

Of course, there are still many details that need to be considered when it comes to actual implementation, as well as numerous hidden pitfalls, regardless of whether we use C++ for implementation. For example, iOS doesn’t even need to consider multiprocess issues.

From my experience in practice, I have deeply realized that it is not difficult to create something new when we personally take on the task of implementing a network library or reporting component. However, in order to achieve perfection, it is inevitable that we need to meticulously refine it and undergo long periods of iteration and optimization.

Homework #

Which data reporting component is currently being used in your company? What problems does it have? Feel free to leave a comment to discuss with me and other students.

Today’s homework is to answer two hidden questions in the implementation. Please write your answers in the comment section.

1. Updating the sampling strategy. When the server updates the sampling strategy, how do we ensure that the new sampling strategy takes effect on the client side as quickly as possible without using push? - 2. Buried point process crash. Have you considered which process, at what time, and in what way should be responsible for timely renaming the buried point data corresponding to Process A to the reporting data directory if Process A suddenly crashes?

Feel free to click on “请朋友读” to share today’s content with your friends and invite them to learn together. Finally, don’t forget to submit today’s homework in the comment section. I have also prepared a generous “learning encouragement package” for students who complete the homework diligently. Looking forward to progressing together with you.