16 Network Optimization Part2 How to Optimize Complex and Changing Mobile Networks

16 Network Optimization Part2 How to Optimize Complex and Changing Mobile Networks #

In the era of the PC internet, network optimization was already a very complex task. For mobile networks, issues such as weak signals, network switching, and network hijacking are even more prominent, making network optimization an increasingly challenging task.

So as a mobile developer, how should we optimize our apps in the face of this complex and ever-changing mobile network? Some may argue that by using mature network libraries like AFNetworking or OkHttp, there is no need for additional optimization. Are you sure you can really make the most of these network libraries? How are they implemented internally, what are the differences between them, and which one is better?

While we may only be client app developers, there is still a lot we can do when it comes to network optimization. Many large-scale applications have already implemented various practices. Today, let’s explore how we can make our applications stay one step ahead in different network conditions.

Mobile Optimization #

Recalling the network architecture diagram I provided last time, a data packet from a mobile phone needs to pass through the wireless network, core network, and external network (Internet) before reaching our server. What factors affect the speed of the entire network request?

From the above diagram, factors that affect network request speed include client network library implementation, server performance, and network link quality. Now let’s start with the client’s network library to see how we can optimize the network.

1. What is network optimization?

Before talking about how to optimize the network, I want to clarify what network optimization actually means. In my opinion, it has the following three core elements:

Speed: How to better utilize bandwidth and further improve network request speed when the network is normal or good.
Weak network: Mobile networks are complex and variable. How to maximize network connectivity when network connection instability occurs.
Security: Network security cannot be ignored. How to effectively prevent third-party hijacking, eavesdropping, and even tampering.

In addition to these three issues, we may also be concerned about power consumption and traffic issues caused by network requests. We will discuss these two aspects together later and will not go into detail today.

So how should we start optimizing for speed, weak network, and security? First, you need to understand the entire process of a network request.

From the diagram, we can see that the entire network request is mainly divided into several steps, and the time consumption of the entire request can be further divided into each step.

DNS resolution: Obtain the IP address of the corresponding domain name through the DNS server. In this step, we pay attention to DNS resolution time, the hijacking of LocalDNS by the network operator, DNS scheduling, and other issues.
Establishing a connection: Establish a connection with the server, including TCP three-way handshake, TLS key negotiation, and other work. The key to our optimization lies in how to choose multiple IP/ports, whether to use HTTPS, and whether it is possible to reduce or even save the time to establish a connection.
Sending/receiving data: After a successful connection is established, you can interact with the server, assemble data, send data, receive data, and parse data. We are concerned about how to utilize bandwidth according to the network conditions, how to quickly detect network latency, and how to adjust packet size in weak networks.
Closing the connection: Closing a connection seems simple, but there are complexities involved. Here, we mainly focus on active and passive connection closures. Generally, we want the client to be able to close the connection actively.

Network optimization aims to reduce the time consumption of each step and create a fast, stable, and secure high-quality network around the core elements of speed, weak network, and security. 2. What is a network library

In actual development work, we rarely directly handle low-level network interfaces like in “UNIX Network Programming”. Generally, we use network libraries. OkHttp, developed by Square, is currently the most popular Android networking library. It has also been included in the Android system by Google to provide networking services for developers.

So what role does a network library play? In my opinion, it abstracts the complexity of lower-level network interfaces, allowing us to use network requests more efficiently.

As shown in the above image, a network library has the following three main functions:

Unified programming interface. Whether it is a synchronous or asynchronous request, the interface is very simple and easy to use. At the same time, we can manage strategies and perform unified stream parsing (JSON, XML, Protocol Buffers), etc.
Global network control. Within the network library, we can perform unified network scheduling, traffic monitoring, and disaster recovery management.
High performance. Since we delegate all network requests to the network library, it is crucial for the library to achieve high performance. To achieve high performance, I pay close attention to speed, CPU, memory, I/O usage, as well as failure rate, crash rate, and protocol compatibility.

Different network libraries have significant differences in implementation, and a few key modules to compare are:

Now, which network library is the best? Next, let’s compare the internal implementations of these three network libraries: OkHttp, Chromium’s Cronet, and WeChat’s Mars.

3. High-quality network libraries

From my understanding, Mogujie, Toutiao, and UC Browser have all done secondary development based on the Chromium network library. WeChat’s Mars has made extensive optimizations for weak networks, and applications such as Pinduoduo, Huya, Lianjia, and Meilishuo are all using Mars.

Let’s now compare the core implementations of several network libraries. As someone who has been involved in network library-related work, I have quite a bit of experience. I have been involved in the development of Mars during my time at WeChat, and I am currently doing secondary development based on the Chromium network library.

Why have I never used OkHttp? Mainly because it does not support cross-platform usage, which is very important for large applications. We do not want to implement optimizations separately for Android and iOS, as it not only wastes manpower but also tends to cause problems.

As for Mars, it is a cross-platform socket layer solution that does not support the complete HTTP protocol. So strictly speaking, Mars is not a complete network library. However, it has made extensive optimizations for weak networks and connections, and it supports long connections. For more details about Mars’ network optimizations, you can refer to the article list on the right side of the Wiki.

As a standard network library, Chromium network library is almost flawless. We can also enjoy the benefits of Google’s subsequent network optimizations, such as TLS 1.3 and QUIC support.

However, it hasn’t done much customized optimization for weak network scenarios, and it also doesn’t support long connections. In fact, my main work in secondary development of the Chromium network library is to fill these two gaps in weak network optimization and long connection support.

Large-scale Network Platforms #

For large companies, we should not only focus on the uniformity of the client network library. Network optimization is not only a client-side task, so we have a unified network middleware platform, which is responsible for providing a complete set of network solutions for both the front-end and back-end.

Ali’s ACCS, Ant’s mPaaS, and Ctrip’s network services are all company-level network middleware services. This way, all network optimizations can benefit all access applications within the entire group.

The following diagram shows the network architecture of mPaaS. All network requests will first go through a unified access layer before being forwarded to the business servers. This allows us to perform various network optimizations at the access layer without affecting the business servers.

1. HTTPDNS

DNS resolution is the first task of our network request. By default, we use the LocalDNS service provided by the network operator. The time spent on DNS resolution can be around 200-300ms on a 3G network and 100ms on a 4G network.

The slowness of DNS resolution is not the main issue with the default LocalDNS. It also has some other problems:

Stability: UDP protocol is stateless and prone to domain hijacking (difficult to reproduce, locate, and solve). Every day, at least millions of domains are hijacked, and there are at least ten large-scale incidents per year.
Accuracy: LocalDNS scheduling often leads to inaccurate results. For example, users in Beijing may be routed to IP addresses in Guangdong, mobile operators may route traffic to China Telecom’s IP addresses, and cross-operator routing can result in slow, or even inaccessible, access.
Timeliness: Network operators may modify the TTL (Time to Live) of DNS records, resulting in delays in DNS updates. Additionally, different network operators implement DNS services inconsistently, making it difficult for us to guarantee the performance of DNS resolution.

To solve these issues, HTTPDNS was introduced. In simple terms, it means performing domain name resolution ourselves by making an HTTP request to the backend to obtain the IP address corresponding to the domain name, thereby directly solving all the aforementioned problems.

WeChat has its own deployed NEWDNS, and Alibaba Cloud and Tencent Cloud also provide their own HTTPDNS services. For large-scale network platforms, we will have a unified HTTPDNS service and integrate it with the operation and maintenance system. In addition to the traditional DNS functionality, we will also add features such as precise traffic scheduling, network testing/grayscale testing, and network fault tolerance.

For more information about HTTPDNS, you can refer to Baidu’s “DNS Optimization” (in Chinese). For the client-side, we can obtain a batch of IP addresses for domain names in advance using pre-requests. However, we need to pay attention to the choice between the IPv4 and IPv6 protocol stacks.

2. Connection Reuse

After DNS resolution, we come to the step of establishing a connection. Establishing a connection involves the three-way handshake of TCP and TLS key negotiation, and the cost of connection establishment is very high. The main optimization strategy here is to reuse connections so that we do not need to establish a new connection for every request.

As I mentioned earlier, the network library does not immediately release connections but instead places them in a connection pool. If there is another request with the same domain name and port, the connection from the connection pool is used for sending and receiving data, reducing the time spent on connection establishment.

Here, we utilize the keep-alive mechanism in the HTTP protocol, and HTTP/2.0 multiplexing further improves connection reuse. It allows multiple requests to be processed simultaneously on a single connection.

Although H2 is very powerful, there are still two problems to be solved here. One is that the same H2 connection only supports the same domain name, and the other is that backends need additional modifications to support HTTP/2.0. At this point, we can make the modifications at the unified access layer, which converts the data to HTTP/1.1 and then forwards it to the server corresponding to the domain name.

This way, all services can enjoy all the optimizations of HTTP/2.0 without making any changes. However, it is worth noting that the multiplexing in H2 fundamentally relies on the same TCP connection. If all domain name requests are concentrated in one connection, TCP head-of-line blocking can easily occur during network congestion.

For client network libraries, whether it is OkHttp or the Chromium network library, only one connection is retained for HTTP/2.0 to the same domain name. For some third-party requests, especially in scenarios such as file downloading and video playback, there may be limitations on connection speed set by the server. In such cases, we can solve the problem by modifying the network library implementation or simply disabling the HTTP/2.0 protocol.

3. Compression and Encryption

Compression

After discussing connections, let’s take a look at the optimization of data transmission. The first thing that comes to mind is reducing the amount of data to be transmitted, which is commonly known as data compression. For HTTP requests, the data mainly consists of three parts:

Request URL
Request header
Request body

For the header, if we use the header compression technology of the HTTP/2.0 connection itself, the request URL and request body are the main parts that need to be compressed.

For the request URL, it usually contains many common parameters, most of which do not change. These unchanged parameters only need to be uploaded once by the client, and we can perform parameter expansion in the access layer for other requests.

As for the request body, on one hand, we need to consider the choice of data communication protocol. Currently, the two most popular data serialization methods in network transmission are JSON and Protocol Buffers. As I mentioned before, Protocol Buffers is more complex to use, but it has significant advantages in data compression rate, serialization, and deserialization speed.

On the other hand, we need to choose a compression algorithm. Common compression algorithms include Gzip, Google’s Brotli, and Facebook’s Z-standard. Among them, Z-standard performs the best in compression rate if a suitable dictionary is trained based on the business data sample. However, maintaining dictionaries for each business incurs high costs. At this point, our unified access layer of the large network platform can play a major role.

For example, we can sample 1% of the request data to train the dictionary. The distribution and update of the dictionary is taken care of by the unified access layer, and the business does not need to be concerned.

Of course, we have other compression methods for specific data. For example, we can use formats with higher compression rates such as WebP, HEVC, and SharpP for images. On the other hand, AI-based image super-resolution is also a powerful tool. QQ Space has saved a lot of bandwidth costs using this technology.

Security

Data security is also a vital aspect of the network. In a large network platform, we use HTTPS-based HTTP/2 channels, which already have TLS encryption. If you are not familiar with the basics of TLS, you can refer to a blog post written by a colleague of WeChat’s backend team: TLS Protocol Analysis.

However, HTTPS also comes with a cost. It incurs a 2-RTT negotiation cost, which is unacceptable in weak network conditions. The decryption cost on the backend servers is also high, requiring a separate cluster in large enterprises to handle this.

There are several ways to optimize HTTPS:

Connection reuse: Increase connection reuse rate through methods such as multiple domains sharing the same HTTP/2 connection and using long connections.
Reduce handshake count: TLS 1.3 can achieve 0-RTT negotiation. In fact, before the release of TLS 1.3, WeChat’s mmtls, Facebook’s fizz, and Ali’s SlightSSL have been deployed on a large scale within enterprises.
Performance improvement: Use ECC certificates instead of RSA. The performance of server-side signing can be improved by 4 to 10 times, but the client-side verification performance is reduced by about 20 times, from 10 microseconds to 100 microseconds. On the other hand, Session Ticket can be used for session reuse, saving RTT time.

After using HTTPS, can we have peace of mind with the entire channel? If the client sets up a proxy, the TLS-encrypted data can be decrypted and potentially exploited. In this case, we can implement “certificate pinning” on the client side to lock the certificate. To maintain compatibility with older versions and flexibility for certificate replacement, it is recommended to lock the root certificate.

We can also perform secondary encryption on the transmitted content. This is implemented in the unified access layer, and the business server does not need to worry about this process. It should be noted that secondary encryption will increase the processing time for both the client and the server, so we need to strike a balance between security and performance.

4. Other optimizations

There are many means of network optimization, some of which may require significant investment, such as deploying international dedicated lines, acceleration points, and accessing multiple IDCs nearby.

In addition, using CDN services and P2P technology are also commonly used methods, especially in live streaming scenarios. Overall, network optimization requires considering multiple factors, including user experience, bandwidth costs, and hardware costs.

Here is a panoramic image of a high-quality network for you.

QUIC and IPv6 #

Today, we have discussed many topics, and some of you may be interested in some of the cutting-edge technologies. Let me briefly talk about QUIC and IPv6.

1. QUIC

The QUIC protocol was implemented by Google in 2013, and in 2018, HTTP based on the QUIC protocol was recognized as HTTP/3. As I mentioned earlier in the discussion about connection sharing, there is a problem of head-of-line blocking with HTTP/2 + TCP. In contrast, QUIC based on UDP is the ultimate solution.

As shown in the following diagram, you can think of QUIC as HTTP/2.0 + TLS 1.3 + UDP.

QUIC

In fact, QUIC has many other advantages:

Flexible Congestion Control: If you want to optimize and upgrade congestion control algorithms and other modules within TCP, it usually takes a relatively long time. With UDP, we don’t need operating system support and can change them anytime. For example, we can directly use Google’s BBR algorithm.
True Connection Sharing: Not only does it solve the problem of head-of-line blocking, it also avoids the need to reconnect when the client changes networks, resulting in a smoother user experience with apps.

Since QUIC is so good, why haven’t we switched to QUIC entirely in production environments? That’s because there are still many issues to be resolved. Currently, the main problems that have been identified include:

Connection establishment success rate: This is mainly due to the penetration problem of UDP. NAT routers, switches, firewalls, and other devices in local networks may block UDP port 443. Currently, the success rate of establishing a QUIC connection in China is about 95%.
Support from network operators: Network operators have insufficient support for UDP channels, and the performance is also unstable. For example, there may be QoS throttling and packet loss. Some small network operators may even not support UDP packets at all.

Despite these issues, QUIC is undoubtedly the future. Of course, through the unified access layer of large network platforms, our business will need minimal modifications. Currently, as far as I know, Tencent, Weibo, and Alibaba are gradually increasing the traffic of QUIC within their internal networks. For more details, you can refer to the links I provided.

2. IPv6

Network administrators deeply feel the preciousness of IP resources, while IPv6, which is dedicated to solving this problem, has been very inactive in China. According to the 2017 IPv6 Support Report, only 0.38% of users in China use IPv6.

IPv6

IPv6 is not only targeted at IoT technologies but also has great significance for the era of interconnected devices. It also has a positive impact on network performance. In our tests in India, using IPv6 reduced the connection time by 10% to 20% compared to IPv4. After implementing IPv6, the endless number of IP addresses means the end of various NAT issues, and connections such as P2P and QUIC will no longer be a problem.

In the past year, both Alibaba Cloud and Tencent Cloud have done a lot of work on IPv6. Of course, this primarily involved changes at the access layer to avoid excessive modifications to the business services.

Summary #

As mobile technology has developed today, cross-terminal and cross-technology stack optimization will become more and more common. Sometimes we need to step out of the perspective of client development and think about the whole network platform from a higher dimension. Of course, network optimization is still very deep, and sometimes we need to have a more in-depth study of the protocol layer and keep an eye on new research results abroad.

In 2018, with the release of the Ministry of Industry and Information Technology’s “Action Plan to Promote the Large-scale Deployment of Internet Protocol Version 6 (IPv6)”, all cloud providers need to complete support for IPv6 by 2020. QUIC was designated as the HTTP/3 draft in 2018, and 3GPP also included QUIC in the 5G Core Network Protocol Phase 2 standard (3GPP Release 16).

With the future popularization of 5G, QUIC, and IPv6 in China, network optimization will never stop. They will drive us to continue to make more attempts and provide users with a better network experience.

Homework #

Which network library does your application use? What other practical experiences do you have in network optimization? Feel free to leave a message and discuss with me and other classmates.

Network optimization is a huge topic, and you need to further expand your learning after class. In addition to the links provided in today’s article, here are some reference materials for you:

Feel free to click “Share with Friends” to share today’s content with your friends and invite them to learn together. Finally, don’t forget to submit today’s homework in the comments section. I have prepared a generous “study encouragement package” for students who complete the homework seriously. Looking forward to progressing together with you.