31 Application Performance Management How to Monitor the User Experience

31 Application Performance Management - How to Monitor the User Experience #

Hello, I’m Tang Yang.

In the previous lesson, I introduced the process of setting up server-side monitoring. With monitoring reports in place, your team can identify problems earlier and have intuitive tools to help analyze and troubleshoot issues when maintaining the vertical e-commerce system.

However, you quickly discover that there are some problems that cannot be identified or even perceived using server-side monitoring reports. For example, a user may report a failure to create an order, but there are no obvious performance fluctuations in the server-side reports. Even in the raw logs stored in Elasticsearch, there is no record of the order creation request. This could be due to a bug in the client or network jitter causing the order creation request not to be sent to the server.

Another example would be some users reporting that it takes a long time to open the product details page using Great Wall Broadband, and even experiencing DNS resolution failures. So, how do we troubleshoot and optimize when we encounter these types of issues?

This involves a concept called Application Performance Management (APM), which means: comprehensive monitoring of various aspects of the application, with the aim of identifying potential issues and resolving them promptly, thereby improving system performance and availability.

Do you find it similar to the server-side monitoring mentioned earlier? In fact, the core focus of server-side monitoring is the performance and availability of backend services, whereas the core focus of application performance management is the end user’s user experience. This means you need to measure the performance of the entire end-to-end chain, from when the client request is sent to when the response data is returned to the client.

If you can achieve this, then both the troubleshooting of order creation issues and the slow page loading problem for Great Wall Broadband users can be discovered and investigated using this monitoring system. So, how do you set up such an end-to-end monitoring system?

How to build an APM system #

Similar to building a server monitoring system, when building an end-to-end application performance management (APM) system, we can consider data collection, storage, and display.

Firstly, in terms of data collection, we can use an agent-like approach to embed an SDK on the client side. The SDK is responsible for collecting information and, after sampling, sending it to the server periodically through a fixed interface. This fixed interface and server are referred to as the APM channel service.

Although there are many metrics that need to be monitored on the client side, such as network conditions, client stuttering, garbage collection data, etc., we can define a common data collection format.

For example, in my previous company, the collected data consisted of the following parts. After the SDK converts these parts of the data into JSON format, it can be sent to the APM channel service. When building your own APM system, you can refer to these data formats directly.

System section: This includes the version number of the data protocol, as well as the lengths of the message header, client message body, and business message body mentioned below.

Message header: This mainly includes the application identifier (appkey), timestamp of message generation, message signature, and encryption key for the message body.

Client message body: This mainly stores relevant information about the client, including client version number, SDK version number, IDFA, IDFV, IMEI, device model, channel number, carrier, network type, operating system type, country, region, latitude, and longitude. Since some of this information is sensitive, we generally encrypt it.

Business message body: This refers to the actual data to be collected, which also needs to be encrypted.

The encryption method is as follows: First, we allocate a pair of RSA public and private keys to the application. When the SDK starts, it requests a policy service to obtain the RSA public key. For encryption, the client randomly generates a symmetric encryption key (Key). This key is used to encrypt the client message body and business message body. So, how do we decrypt the data after it is sent to the APM channel service?

The client uses the RSA public key to encrypt the key for symmetric encryption. It is then stored in the message header (i.e., the message body encryption key). The APM channel service uses the RSA private key to decrypt the key, which can then be used to decrypt the client message body and business message body.

Finally, we assemble the message header, client message body, business message body, and timestamp in the message header, and generate a digest using MD5. This digest is stored in the message header (i.e., the message signature). In this way, after receiving the message, the APM channel service can use the same algorithm to generate a digest and compare it with the digest sent, to prevent message tampering.

Once the data is collected by the APM channel service, we first parse the JSON message to obtain the specific data, and then send it to a message queue. After consuming the data from the message queue, a copy of the data is written to Elasticsearch to be saved as raw data, and another copy is written to the analytics platform to generate client reports.

With this APM channel service, we can report the information collected from the client side to the server for centralized processing in a unified manner. This way, you can collect performance and business data from the client side and promptly detect issues.

Now the question is: although you have built a client monitoring system, do you want to monitor all user network data, stuttering data, etc. in our e-commerce client system, or do you have a specific focus? It’s important to note that unclear monitoring information can make problem troubleshooting more difficult, and this is what we will explore next—deciding which user information needs to be monitored.

Which user information needs to be monitored #

In my opinion, the primary goal of building an end-to-end monitoring system is to address the issue of monitoring client networks, as most of the problems we encounter are related to client networks.

In the complex network environment in China, major network operators operate independently and have different link qualities in different regions. Meanwhile, smaller operators have mixed services and their quality cannot be guaranteed. Let me give you a typical example.

Previously, when discussing DNS, I mentioned that when performing DNS resolution, in order to shorten the query process, we first query the local DNS of the network operator. However, local DNS can be unreliable. Some small operators, in order to save traffic, will redirect some domain names to content caching servers or even to advertising or phishing websites. This is called domain hijacking. Some operators are lazy and do not resolve domain names themselves; instead, they forward the resolution requests to other operators. This results in the authoritative DNS receiving requests from incorrect source IP addresses of operators. As a result, the resolved IP and the source of the request come from different operators, leading to cross-network traffic and prolonged DNS resolution times. It is necessary to monitor these issues in real time in order to identify them as soon as possible and provide feedback to the operators for resolution.

So, how do we collect network data? Generally speaking, we use a tracking method to log the duration and occurrence of each step of network requests. Let me explain how this is done using the Android system as an example.

In Android, we usually use OkHttpClient to make API requests. OkHttpClient provides the EventListener interface, which allows the caller to receive network request events such as the start of DNS resolution and the end of DNS resolution. With this, you can track the duration of each stage of a network request. I have written a specific example code that calculates the DNS resolution time for a request, which you can refer to.

public class HttpEventListener extends EventListener {

    final static AtomicLong nextCallId = new AtomicLong(1L);

    private final long callId;

    private long dnsStartTime;

    private HttpUrl url ;

    public HttpEventListener(HttpUrl url) {

        this.callId = nextCallId.getAndIncrement(); // Initialize a unique identifier for this request

        this.url = url;

    }



    @Override

    public void dnsStart(Call call, String domainName) {

        super.dnsStart(call, domainName);

        this.dnsStartTime = System.nanoTime(); // Record the start time of DNS resolution

    }



    @Override

    public void dnsEnd(Call call, String domainName, List<InetAddress> inetAddressList) {

        super.dnsEnd(call, domainName, inetAddressList);

        System.out.println("url: " + url.host() + ", DNS time: " + (System.nanoTime() - dnsStartTime)); // Calculate the DNS resolution time

    }

}

With this EventListener, you can inject it when initializing the HttpClient using the following code:

OkHttpClient.Builder builder = new OkHttpClient.Builder()

        .eventListenerFactory(new Factory() {

            @Override

            public EventListener create(Call call) {

                return new HttpEventListener(call.request().url());

            }

        });

In this way, we can obtain the duration of each process during a request, including the following main items:

Waiting time: When making asynchronous calls, the request is first cached in a local queue and is handled by a dedicated I/O thread. During this time, there is a waiting period before the I/O thread actually handles the request.
DNS time: The time taken for DNS resolution.
Handshake time: The time taken for the TCP handshake.
SSL time: If the service is HTTPS, there will be a time taken for SSL authentication.
Sending time: The time when the request packet is sent.
First byte time: The time when the client receives the first response packet from the server.
Package receiving time: The time when we receive all the data.

With this data, we can send it to the server using the mentioned APM channel service. This way, both the server and client teams can query the raw data from Elasticsearch, perform aggregation, statistical analysis, and generate reports on client request monitoring. This allows us to optimize specific processes of HTTP requests.

Monitoring user networks can bring three main values:

Firstly, all the monitoring data in this user network monitoring system is derived from the client, providing real-time and accurate feedback on user experience.

Secondly, it serves as a guiding target for performance optimization. When performing any optimization actions such as business architecture reconstruction, service performance optimization, or network optimization, it can provide feedback on user performance data, guiding improvements in interface performance, availability, and other metrics.

Lastly, it can help monitor the quality of CDN links. Previously, the monitoring of CDNs heavily relied on CDN vendors, which raised a problem: CDN vendors cannot obtain full-link monitoring data from clients. Sometimes, if there are issues with the client to CDN link, the CDN vendor is not aware of it. Client monitoring addresses this deficiency and can prompt timely optimization and adjustment of problematic routes through alert mechanisms.

In addition to reporting network data, we can also report data on exceptional events, such as failures in login, placing orders, loading product information, or rating and commenting on products in your vertical e-commerce system. You can detect these exceptional data in your business logic code and, of course, upload them to the server via the APM channel service. This makes it convenient for both server and client teams to troubleshoot issues and support data in your version’s grey release.

Overall, if the system is considered the skeleton, then the specific monitored data is the soul, as data is the main content of monitoring, and the system is just a medium for presenting the data. Therefore, you need to continuously improve the collection of data in the process of system operation and maintenance. This is also the process of continuously upgrading and improving your monitoring system.

Course Summary #

That wraps up the content covered in this lesson. In this lesson, I primarily guided you through the process of setting up an end-to-end APM monitoring system. The key points you need to understand are:

The data collected from the client side can be uploaded to the APM server using a common message format. The server stores the data in Elasticsearch, which allows for querying of raw logs and generating monitoring reports for the client side.
User network data is important for troubleshooting the interaction between the client and server. You can obtain this data by embedding code in your application.
Whether it is network data, exception data, or information related to latency, crashes, traffic, or power consumption, you can package them into APM message format and upload them to the APM server. The traces left by users on the client side can help you optimize their user experience.

To sum up, monitoring and optimizing the user experience is the ultimate goal of application performance management. However, server-side developers often fall into a misconception, thinking that as long as they ensure good performance and availability of their service interfaces, that’s enough. In reality, response time of interfaces is just a small part of our monitoring system. Creating an end-to-end monitoring system that covers the entire application flow is the ultimate form of your monitoring system.