31 Data Reporting Part2 What Is a Big Data Platform

31 Data Reporting Part2 What is a Big Data Platform #

Data serves as the bridge connecting products and users. It reflects the users’ usage of the product and is an important basis for making business decisions. Although the accuracy and real-time nature of data collection can be guaranteed from the source through “highly available reporting components,” as the complexity of App business iterations increases, situations such as missing tracking points, incorrect tracking points, and inconsistent tracking points across multiple platforms often occur, which affects the stability of business data.

I have seen many teams manage their tracking point documents very irregularly, and some are still using Excel to manage tracking point documents, which often results in the inability to find the definitions of certain tracking points. With the maturity of tracking point technology and processes, we need a complete set of solutions to ensure the stability of data.

So, what specifications should tracking points follow? How can we implement guidance and monitoring throughout the entire tracking process? How can we build an all-in-one tracking platform, including tracking management, tracking development, tracking testing and validation, and tracking data monitoring? And what does a big data platform look like on top of the tracking platform?

Basics of Data Collection #

We know that the launch of a business data collection process involves multiple stages such as requirement gathering, development, testing, etc. It requires collaboration among various teams including product, development, and testing. For larger teams, a dedicated data team may also be involved.

For traditional data collection, issues like erroneous or missed data collection are bound to occur repeatedly. All the teams involved have to spend a lot of effort to diagnose and address the accuracy issues. Especially, if there is a problem with data collection, we still need to rely on releasing a new version of the App. It is evident that the fix cycle for data collection is long and the cost is also huge.

So how do we solve this problem? Let’s first think about how we can achieve accurate data collection.

To achieve accurate data collection, the following four conditions must be met, requiring strict management of the data collection process. Therefore, you need to:

Unified data collection specifications. From the format of logs to the meaning of parameters, there must be unified rules within the application or the entire company.
Unified data collection process. Throughout the data collection process, the product, development, testing, and data teams need to shoulder their respective responsibilities and collaborate extensively. This is achieved through a unified and standardized process to achieve accurate data collection.

By enforcing unified data collection specifications and processes, we hope to reduce the development cost of data collection and ensure data accuracy. Now let’s take a look at how to implement this in practice.

1. Unified data collection specifications

When you open the homepage of Taobao, you may have noticed that there is a parameter called SPM in the URL.

https://www.taobao.com/?spm=a21bo.2017.201857.3.5af911d9ycCIDq

What does this SPM parameter mean? SPM stands for Super Position Model, which is a unified data collection specification protocol within Alibaba. Whether it is H5 or Native (Android and iOS), it must comply with this specification.

Just like the link above, SPM is composed of four sections: A.B.C.D. Each section represents the following meanings:

A: Website/Business, B: Page, C: Page Section, D: Position within the Section
 
 Note: a21bo.2017.201857.3.5af911d9ycCIDq has a total of 5 parts; the last part is a randomly assigned feature code used to ensure the uniqueness of each SPM value.

SPM mainly covers three types of events: page visits, control clicks, and exposures. It can be used to record specific information about user clicks or views on the current page and can also infer where the user came from on the previous page. Based on the SPM specification, Taobao can obtain various metrics for each page such as PV, click-through rate, dwell time, conversion rate, and user paths.

For data collection specifications, from common parameters to various business-specific parameters, we need to define a complete log format. Currently, this SPM specification has been widely adopted within Alibaba and its external partners. By having unified rules across departments and clients, it not only reduces internal learning and communication costs but also brings great convenience to subsequent data storage, verification, and analysis.

“Everyone has their own understanding of Hamlet,” and every company’s situation may not be the same, so we cannot guarantee that Alibaba’s data collection specifications are suitable for all enterprises. However, regardless of the final decision on which specification to use, at least within the company, there should be a unified data collection specification.

If you want to know more about the SPM specification, you can refer to What is the purpose of SPM parameters and Alibaba’s Log Collection Sharing.

2. Unified data collection process The entire process of data tracking involves multiple teams such as product, development, testing, and data teams. Without a well-defined process, it is easy to encounter a “four-country melee” situation and blame each other when data issues arise.

“No rules, no standards.” We need to establish a unified data tracking process and strictly regulate the steps and responsibilities of each participant in the process.

tracking process

Requirement Phase. In the requirement review phase, the product team needs to specify the tracking requirements. If there is a data team, the tracking requirements from the product team need to be reviewed by the data team. The testing team also needs to create a corresponding testing plan based on the tracking requirements. This phase is mainly the responsibility of the product team, ensuring that both the requirement and testing plans are in order.
Development Phase. In the development phase, developers implement the tracking codes based on the tracking requirement document. After development, the code needs to be tested locally. This phase is the responsibility of the development team.
Testing Phase. In the testing phase, testers perform local acceptance tests based on the tracking requirements and rules. This phase is the responsibility of the testing team.
Gradual Deployment Phase. In the gradual deployment phase, testers are responsible for monitoring the online tracking, checking if the data meets the tracking requirements and rules, while the product team needs to ensure that the tracking data meets expectations. This phase is primarily the responsibility of the testing team, but the product team also needs to be involved.

By establishing a unified tracking process, we clearly define the tasks and responsibilities for each phase, which reduces the cost of tracking and decreases the chances of errors.

3. Tracking Methods

There seem to be many different approaches to tracking, such as code-based tracking, visual tracking, declarative tracking, and invisible tracking. “Meituan Dianping’s Practical Guide to Front-end Invisible Tracking” and “NetEase HubbleData’s Android Invisible Tracking Practice” categorize tracking methods into three types:

Code-based Tracking. Tracking data is directly sent by calling the tracking API at specific points. This is the approach used by most third-party data tracking providers like Umeng and Baidu Analytics.
Visual Tracking. By configuring tracking points through visual tools, the front-end framework automatically parses the configuration and sends the tracking data, achieving “invisible tracking”. The most representative example is the open-source Mixpanel.
Invisible Tracking. It doesn’t mean that no tracking code is needed, but automatically collects all events and sends the data. Useful data is filtered during back-end data processing. The most representative case in China is GrowingIO.

The most commonly used method is “code-based tracking”. As for “visual tracking” and “invisible tracking”, they both require automatic reporting of tracking data and automatic interception of events.

To further understand, let’s take the example of tracking button clicks. We can achieve this through different methods:

Instrumentation Replacement. For button clicks, we can use ASM to globally override the onClick method in View.OnClickListener to our own Proxy implementation and add tracking code internally.
Hook Replacement. Using Java reflection, starting from the RootView, we can recursively traverse all View objects and hook their corresponding OnClickListener objects, replacing them with our own Proxy implementation.
AccessibilityDelegate Mechanism. With AccessibilityDelegate, we can detect various states such as clicks, selections, scrolling, and text changes. By utilizing AccessibilityDelegate, when a control triggers a click event, we can add tracking code through the specific AccessibilityEvent callback.
dispatchTouchEvent Mechanism. The dispatchTouchEvent method is the function for dispatching system touch events. By overriding these functions, we can listen to all touch events.

Big Data Platform #

Although we have unified tracking specifications and processes, the entire process still relies on manual labor. Taking tracking requirement management as an example, many teams are still using Excel to manage it. With continuous modifications, the document becomes more and more complex, which also makes it difficult to track the history.

So how can we build a one-stop tracking platform that includes tracking management, tracking development, tracking testing and validation, and tracking data monitoring?

1. One-stop tracking platform

The one-stop tracking platform can visualize the management of tracking definitions and assist in locating tracking-related issues during development and testing. It can also automate the validation of local and online tracking data, as well as analyze and notify automatically, reducing the cost of tracking development and validation and improving data quality.

As shown in the above diagram, it mainly consists of four sub-platforms.

Tracking Management Platform. Manages the entire tracking solution for the application, including the definition and rules of each field in the tracking data. For example, for the QQ number field, it is required to be numeric and non-empty. For the SPM specification, the tracking management platform also records the name corresponding to each page. For example, the Taobao homepage may be represented by a123.
Tracking Development Assistant Platform. The development assistant platform is designed to improve the efficiency of tracking development. For example, as I mentioned earlier, visual tracking or automatically generating code based on the fields and rules defined in the tracking management platform. Developers can import the tracking definition class with just one click and only need to add the corresponding calls in the code.
Tracking Validation Platform. The validation platform is very important for tracking testing by developers and local acceptance testing by testers. We can switch to real-time data uploading mode through QR code scanning or other methods. The tracking validation platform pulls the tracking definitions and rules from the configuration platform and displays and validates the data reported by the client in real-time. For example, if a tracking misses a field, has an extra field, or violates the preset rules, such as having letters in the QQ number or having an empty numeric value.

Since manual testing may not cover all scenarios, we also need to rely on automation and gray verification. The overall approach is the same, but it uses an online non-real-time channel and outputs data verification reports every hour or every day.

Tracking Monitoring Platform. The goal of monitoring is to ensure the robustness of the entire data chain, including monitoring the “highly available data reporting component” on the client side, such as quality monitoring and disaster recovery monitoring discussed in the previous column. It also includes monitoring of backend data parsing, storage, and analysis, such as total log volume, abnormal log volume, and log loss volume.

I’m not sure if you noticed, but the tracking management platform also manages sampling strategies. Going back to the homework question I left for you in the last article, when our server sampling strategies are updated, how can we ensure that the new sampling strategy takes effect on the client side as quickly as possible without using push?

Actually, it’s very simple. When a user changes the sampling configuration of a tracking, the tracking configuration platform will increment the sampling strategy version number and push the latest strategy and version number to the data collection module. The tracking SDK will carry its local strategy version number with each report. If the local strategy version number is less than the server version number, the data collection module will directly return the latest strategy in the response. This method ensures that as long as any client successfully reports a buried point, the latest sampling strategy can be obtained. In fact, many other configurations are updated in a similar way.

2. Data Product

The buried point is a one-stop platform, and it is only a small part of the data platform. It is responsible for ensuring the accuracy of the reported data. From my understanding, the simplified architecture of the entire big data platform looks like this:

Collection Tool Layer. The application’s reporting component is responsible for data point embedding and assembling and reporting logs. It needs to ensure the accuracy and real-time nature of the data.
Data Collection Layer. The data collection layer cleans and processes the logs, and may also interact with our one-stop buried point platform. Then, depending on the data subscription, the data is distributed to different computing modules.
Data Computation Layer. The computation layer is mainly divided into offline computation and real-time computation. Offline computation produces results from the received data and generally takes at least an hour or more. Real-time computation can achieve calculations at the second or minute level, and is generally only used for monitoring core business. Also, due to the issue of computational load, real-time computation usually only calculates PV and does not calculate UV results.
Data Service Layer. Whether it is offline computation or real-time computation, we store the results in the data service layer, usually using a DB. The data service layer’s main purpose is to shield the underlying complex implementation. We only need to query the final computation results from here.
Data Product Layer. Data products are generally divided into two types: business-oriented and monitoring-oriented. Business-oriented products are used to view and analyze business data, such as page visits, funnel models, page flows, user behavior path analysis, etc. Monitoring-oriented products are mainly used to monitor business data, such as real-time traffic monitoring or non-real-time business data monitoring.

The data service layer is a very good design. It allows everyone in the company to easily implement different types of data products. We don’t need to worry about the complex implementation of data collection and computation in lower layers, we just need to take the data out and create a reporting and display system that meets our own needs.

For real-time monitoring, WeChat’s IDKey and Alibaba’s Sunfire are very powerful systems. They can achieve real-time PV monitoring at the minute or even second level.

For small and medium-sized companies, they may not have the ability to build a complete big data platform, so they may need to use third-party services. For example, Alibaba Cloud provides a set of OneData services.

Of course, we can also build our own data platform, but for massive amounts of big data, a stable and high-performance data computation layer is very complex. We can use pre-packaged data computing and service layers, such as Alibaba Cloud’s MaxCompute big data computing service. Then, on top of the data computation layer, we can implement data products that meet our own needs.

Here is a simplified diagram of the overall architecture of a data platform, and you can also refer to the implementation of Dianping’s UAS: Dianping User Behavior System.

Summary #

In the past few years, big data has been a frequently mentioned concept. Regarding big data, or the accompanying big data platforms, I have two main reflections:

1. Technological changes are for meeting demands. If Taobao didn’t face hundreds of millions of user visits every day, and wasn’t repeatedly submerged in the black hole of data, they wouldn’t have made the various arduous efforts in the field of big data. Technology is meant to solve pain points in business scenarios. On the other hand, it is true that there are barriers to entry for big data, and smaller enterprises may not necessarily have the opportunity to undergo such training.

2. There are no shortcuts for infrastructure construction. High-availability reporting components, all-in-one tracking platforms, and various data products - the construction of these infrastructure elements requires sufficient patience, as well as human and material resources. Why adopt such norms and processes? Why is the architecture designed this way? Although these solutions may not be optimal, they are the result of trial and error, and extensive practical experience.

Homework #

Does your company have a unified specification and process for data tracking? How is the supporting infrastructure for data developed? What problems have you encountered in data security? Please leave a comment to discuss with me and other classmates.

When it comes to building a data platform, international companies like Facebook and domestic companies like Alibaba have done a great job. I recommend you to read a book called “The Road to Big Data: Alibaba’s Big Data Practice” written by Alibaba’s data experts.

Feel free to click on “Please invite a friend to read” to share today’s content with your friends and invite them to study together. Don’t forget to submit today’s homework in the comments section. I have also prepared a generous “learning encouragement package” for students who complete the homework seriously. Looking forward to making progress together with you.