17 Network Optimization Part3 How to Monitor Networks With Large Data

17 Network Optimization Part3 How to Monitor Networks with Large Data #

After learning in the previous article about how to build a high-quality network, do we have enough knowledge already? Let’s think about it. When a network request is made from a mobile phone to the backend server, it involves hardware facilities such as base stations, fiber optic cables, routers, as well as the involvement of network operators and server rooms.

Whether it’s a base station failure, a fiber optic cable being cut, network operator hijacking, or issues with our server rooms and CDN service providers, all of these can potentially cause problems with user networks. Have you ever experienced various unexpected network issues online? Many companies’ operations and maintenance personnel live in constant fear and are exhausted every day.

“Kind” malfunctions mysteriously resolve themselves after a period of time, while “stubborn” malfunctions are difficult to locate and solve. How do these malfunctions occur? Why do they suddenly recover? How many users are affected and which ones? Solving these problems requires a high-quality network, and a high-quality network requires strong monitoring capabilities. Today, let’s take a look at how to monitor networks.

Mobile Monitoring #

For the mobile end, we may have various network requests. Even if we use the OkHttp network library, there may be some developers or third-party components that use the system’s network library. So how should we monitor all network requests from the client in a unified way?

1. How to monitor the network

First method: Instrumentation.

For compatibility reasons, the first thing I think of is instrumentation. The open-source performance monitoring tool ArgusAPM from 360 utilizes Aspect to switch instrumentation and achieve monitoring of system and OkHttp network requests.

The instrumentation implementation of the system network library can refer to TraceNetTrafficMonitor. It mainly utilizes the aspect feature of Aspect and for OkHttp interception, you can refer to OkHttp3Aspect which is a bit simpler because OkHttp itself has a proxy mechanism.

@Pointcut("call(public okhttp3.OkHttpClient build())")
public void build() {
}

@Around("build()")
public Object aroundBuild(ProceedingJoinPoint joinPoint) throws Throwable {
    Object target = joinPoint.getTarget();
    if (target instanceof OkHttpClient.Builder && Client.isTaskRunning(ApmTask.TASK_NET)) {
        OkHttpClient.Builder builder = (OkHttpClient.Builder) target;
        builder.addInterceptor(new NetWorkInterceptor());
    }
    return joinPoint.proceed();
}

Instrumentation seems good, but it’s not comprehensive. It cannot monitor network requests if other network libraries or native code requests are used.

Second method: Native Hook.

Just like I/O monitoring, at this point, we think of the powerful Native Hook. We generally hook several methods related to networking:

Connection-related: connect.
Sending data-related: send and sendto.
Receiving data-related: recv and recvfrom.

Android has some differences in Socket logic for different versions. Taking Android 7.0 as an example, the call stack for Socket connection is as follows:

java.net.PlainSocketImpl.socketConnect(Native Method)
java.net.AbstractPlainSocketImpl.doConnect
java.net.AbstractPlainSocketImpl.connectToAddress
java.net.AbstractPlainSocketImpl.connect
java.net.SocksSocketImpl.connect
java.net.Socket.connect
com.android.okhttp.internal.Platform.connectSocket
com.android.okhttp.Connection.connectSocket
com.android.okhttp.Connection.connect

The “socketConnect” method corresponds to a Native method defined in PlainSocketImpl.c. Looking at the makefile, we can see that they are compiled into libopenjdk.so. However, in Android 8.0, the entire calling process has changed completely. For compatibility reasons, we directly PLT hook all so files in memory, but we need to exclude libc.so where the Socket functions are located.

hook_plt_method_all_lib("libc.so", "connect", (hook_func) &create_hook);
hook_plt_method_all_lib("libc.so, "send", (hook_func) &send_hook);
hook_plt_method_all_lib("libc.so", "recvfrom", (hook_func) &recvfrom_hook);
...

One downside of this approach is that it also takes control of the system’s Local Socket, so we need to add filtering conditions in the code. In today’s sample, I have provided a simple implementation for you. No matter which hook method you choose, once you master it, you’ll find that it is not difficult. We need to patiently search and clarify the entire call process.

Third method: Unified network library.

Even though we can obtain all the network calls, let’s think about the use cases. Simulating network data, measuring application traffic, or separately proxying WebView’s network requests.

Usually, we don’t really care about third-party network requests, and for our own application’s network requests, the best monitoring method is to use a unified network library. However, we can monitor which parts of the application use other network libraries instead of the default unified network library through instrumentation and hooking.

In the previous article, I mentioned that “network quality monitoring” should be a very important module in a client’s network library, and it will also work together with the large network platform’s access service. With a unified network library, it’s indeed impossible to monitor third-party network requests. However, we can obtain the overall traffic usage of the application through other means. Let’s take a look together.

2. How to monitor traffic

Monitoring application traffic is very simple, usually done through the TrafficStats class. TrafficStats is an interface introduced in Android API 8 for obtaining network traffic since device boot for either the entire phone or a specific UID. As for how to use it, you can refer to Facebook’s open-source library from some time ago, network-connection-class.

getMobileRxBytes()        // Total bytes received over mobile network since device boot, not including WiFi
getTotalRxBytes()         // Total bytes received over all networks since device boot, including WiFi
getMobileTxBytes()        // Total bytes transmitted over mobile network since device boot, not including WiFi
getTotalTxBytes()         // Total bytes transmitted over all networks since device boot, including WiFi

Its implementation principle is actually very simple, it uses the Linux kernel’s statistics interface. Specifically, the following two proc interfaces.

// The stats interface provides flow information of various UIDs on various network interfaces (wlan0, ppp0, etc.)
/proc/net/xt_qtaguid/stats
// The iface_stat_fmt interface provides summary flow information of each interface
/proc/net/xt_qtaguid/iface_stat_fmt

The TrafficStats works by reading proc and summing up the flow of all network interfaces for the target UID. But what if we don’t use the TrafficStats interface and parse the proc files ourselves? Then, we can obtain the flow under different network interfaces and calculate the flow for WiFi, 2G/3G/4G, VPN, hotspot sharing, WiFi P2P, and other different network statuses.

However, it is very regrettable that starting from Android 7.0, the system no longer allows us to directly read the stats file, to prevent developers from accessing the flow information of other applications. Therefore, we can only get the flow information of our own application through TrafficStats.

Other than traffic information, through /proc/net, we can also obtain a lot of network-related information, such as network signal strength, signal level, etc. Both Android and iPhone have a network testing mode. Interested students can give it a try.

iPhone: Open the dialing interface and enter “3001#12345#” and press the dial button.
Android phone: Open the dialing interface and enter “##4636##” and press the dial button (it can enter the engineering test mode, some versions may not support it).

How can the system determine that the WiFi is “connected, but unable to access the internet”? Let’s think back to the homework I assigned in the 15th column:

iPhone’s Wi-Fi Assistant, and Xiaomi and OnePlus’ Adaptive WLAN will automatically switch to mobile network when detecting unstable WiFi. So, please think about how do they achieve the detection and how do they differentiate if the application’s backend server has a problem or if there is a problem with the WiFi itself?

I have looked at the replies from the students and most of them believe that it requires accessing a public IP. In fact, for mobile phone manufacturers, they do not need this. They have access to a lot of information at the bottom layer.

Network card driver layer information. Such as radio frequency parameters, which can be used to determine the signal strength of WiFi; network card packet queue length, which can be used to determine if the network is congested.
Protocol stack information. Mainly obtaining packet sending, receiving, delay, and packet loss information.

If a WiFi has sent packets but has not received any ACK packets, it can preliminarily be determined that the current WiFi has a problem. In this way, the system can know that the current WiFi is most likely the problem, and it doesn’t care whether it’s caused by our backend server.

Monitoring of Large-scale Network Platforms #

In the previous section, I discussed some methods for monitoring network requests and traffic. However, I have not yet answered how to build a powerful network monitoring system. Like network optimization, network monitoring cannot be accomplished by the client alone; it is also an important part of the entire large-scale network platform.

However, first, we need to objectively acknowledge that this task is not easy because network problems have the following characteristics:

Real-time nature. Some network issues cannot wait and may be lost quickly.
Complexity. Network problems may be related to countries, regions, operators, versions, systems, device models, CDNs, etc. They involve multiple dimensions and large amounts of data.
Lengthy links. The entire request chain is very long and can involve client-side faults, network link faults, and service faults.

Therefore, so-called network monitoring cannot guarantee that the cause of the problem can be clearly identified. Our goal is to quickly identify problems and obtain as much auxiliary information as possible to assist us in troubleshooting more easily.

Now let’s take a look at what information can help us better detect and resolve problems from the perspectives of the client and the access layer.

1. Client Monitoring

Client monitoring is done using a unified network library. Consider the following aspects that we need to focus on:

Latency. Generally, we are concerned about the DNS time, connection time, first packet time, total time, etc. for each request. Metrics like the 1-second fast opening rate and the 2-second fast opening rate may be used.
Dimensions. Network type, country, province, city, operator, system, client version, device model, requested domain name, etc. These dimensions are mainly used for analyzing issues.
Errors. DNS failures, connection failures, timeouts, error codes, etc. Metrics such as DNS failure rate, connection failure rate, and overall network failure rate may be used.

Based on this data, we can also summarize the network access overview of the application. For example, no matter where we go in China, we always ask if there is WiFi, and WiFi availability can exceed 50%. This is actually much higher compared to overseas locations, where WiFi availability is only about 15% in India.

Similarly, we can monitor latency and error metrics by various dimensions such as client versions, countries, operators, and domain names.

Due to the large number of dimensions and wide range of each dimension’s values, real-time calculation of the entire data volume would be enormous. For client-reported data, WeChat can achieve monitoring and alerting at the minute level. However, for the sake of simplicity, we will ignore UV and only calculate PV for some dimensions every minute.

2. Access Layer Monitoring

Client monitoring data is more extensive than access layer data because it is possible for some data to be rejected by the access layer before it even arrives, such as in the case of operator hijacking.

However, monitoring the data at the access layer is still very necessary. The main reasons are:

Real-time nature. If the client uses real-time reporting at the second level, it can significantly affect user performance. This is not a problem for the server, which can easily achieve second-level monitoring.
Reliability. If certain network issues occur, the client’s data reporting channel may also be affected, making the client’s data unreliable.

What data should we focus on at the access layer? Generally, we pay attention to the ingress and egress traffic of the service, the server’s processing latency, error rates, etc.

3. Monitoring and Alerting

Both client and access layer monitoring operate in a layered manner.

Real-time monitoring. The information obtained from second-level or minute-level real-time monitoring is relatively limited. For example, it may only include metrics like page views (PV) and error rates, without breaking them down into hundreds or thousands of dimensions or tracking independent visitor numbers (UV). The purpose of real-time monitoring is to quickly detect problems.
Offline monitoring. Hourly or daily monitoring allows us to expand the range of dimensions to monitor. Its purpose is to better identify the scope of the problem while monitoring.

Below is a simple example of analysis based on dimensions such as client, country, and operator. Of course, more often, there is a problem with a specific service. In such cases, the problem can be easily identified using dimensions like domain name or error code.

In order to achieve accurate automated alerting while monitoring, we also face some challenges. The difficulty lies in the fact that if the rules are too strict, there may be false negatives; if they are too loose, there may be too many false positives.

In the industry, there are generally two algorithms for alerting. One is rule-based, for example, when the failure rate rises significantly compared to historical data or when traffic drops sharply. The other algorithm is intelligent alerting based on time series analysis or neural networks. Users do not need to enter any rules; having a sufficient amount of historical data is enough to achieve automatic alerting. The accuracy of intelligent alerting currently has some issues. Adding a small number of rules on top of the intelligent algorithm may be a better option.

If we receive a network alert online, with the help of access layer and client monitoring reports, we can make a rough judgment. But how can we determine the root cause of the problem? Can we obtain the complete network logs of users? Or even remotely diagnose the user’s network conditions? I will discuss “how to quickly locate network problems through network logs and remote diagnostics” separately in the second module of this column.

Summary #

Monitoring, monitoring, and more monitoring. Many performance optimization tasks actually rely more on monitoring than actual optimization.

Why is monitoring so important? For large companies, each project may involve hundreds or even thousands of people. And what large companies want is not just to achieve certain things today or in this version, but to ensure that every day and every version maintains a high quality application. On the other hand, with a well-established analysis and monitoring platform, we can simplify complex tasks and turn seemingly unattainable optimization work into something everyone can do.

Finally, let me share my thoughts. When working, I hope you can see further and think from a higher perspective. Think about how, if I can do this well, to ensure that others won’t make mistakes or enable everyone to do better.

Homework #

For network issues, what monitoring methods have you tried? Have you encountered any network failures that left a deep impression on you? How were they ultimately resolved? Please leave a comment to discuss with me and other classmates.

Today, the exercise we practiced Sample is by using PLT Hook to proxy several important functions related to Sockets. This time, an additional method is added to hook all already loaded libraries at once.

int hook_plt_method_all_lib(const char* exclueLibname, const char* name, hook_func hook) {
  if (refresh_shared_libs()) {
    // Could not properly refresh the cache of shared library data
    return -1;
  }
  int failures = 0;
  for (auto const& lib : allSharedLibs()) {
      if (strcmp(lib.first.c_str(), exclueLibname) != 0) {
        failures += hook_plt_method(lib.first.c_str(), name, hook);
      }
  }
  return failures;
}

I hope that through these exercises, you can learn to apply Hook technology in practice.

Feel free to click “Ask a friend to read” to share today’s content with your friends and invite them to learn together. Don’t forget to submit today’s homework in the comment section. I have also prepared a generous “study encouragement package” for students who complete their homework diligently. I look forward to learning and progressing together with you.