23 Rpc Framework Under 100,000 Qps How to Realize Microsecond Level Service Invocation

23 RPC Framework Under 100,000 QPS - How to Realize Microsecond-Level Service Invocation #

Hello, I’m Tang Yang.

In lecture 21 and 22, your team has decided to split the vertical e-commerce system into microservices to address scalability and high development costs. However, during the learning process, you have discovered that service splitting introduces new problems. These problems, as mentioned in the previous lecture, can be summarized into two main points:

Communication across networks introduced by separately deployed services.
Governance of services after splitting into smaller microservices.

To solve these two problems, it is necessary to understand the basic principles and techniques of middleware required for microservices. In this lecture, I will guide you to master the core component for solving the first problem: the RPC framework.

Let’s consider the following scenario: the QPS of your vertical e-commerce system has reached 20,000 requests per second. After splitting the services into microservices, assuming that you need to call 4-5 services to complete a single request, the RPC service needs to handle approximately 100,000 requests per second. So, how can you design an RPC framework to handle such a high request volume? Here’s what you need to do:

Choose the appropriate network model and adjust network parameters accordingly to optimize network transmission performance.
Choose the appropriate serialization method to improve the performance of packaging and unpackaging.

Next, I will start from the principles to help you develop a rational understanding of RPC. This way, you will have a clear understanding of your design goals when designing an RPC framework.

RPC as You Know It #

When it comes to RPC (Remote Procedure Call), you are no stranger to it. It refers to the technology of calling services deployed on another computer through the network.

An RPC framework encapsulates the details of network calls, allowing you to call remotely deployed services just like calling local services. You may think that only emerging frameworks such as Dubbo, gRPC, and Thrift are considered RPC frameworks. In fact, strictly speaking, you have been exposed to RPC-related technologies a long time ago.

For example, Java has a native remote invocation framework called RMI (Remote Method Invocation), which allows Java programs to call methods of Java objects on another machine through the network. It is a method of remote invocation and the foundation of the well-known EJB during the J2EE era.

To this day, you can still use Spring’s “RmiServiceExporter” to expose a Spring-managed bean as an RMI service, thus continuing to use RMI to achieve inter-process method calls. The reason RMI did not become as popular as Dubbo or gRPC is because it has some shortcomings:

RMI uses the JRMP (Java Remote Messaging Protocol), a protocol specifically designed for Java remote objects, for communication, which limits its communication to only Java programs, making it unable to achieve cross-language communication.

RMI uses Java’s native object serialization, resulting in large byte arrays and poor efficiency.

Another technology you might have heard of is Web Service, which can also be considered an implementation of RPC. Its advantage is that it uses the HTTP+SOAP protocol, ensuring that calls can be made cross-language and cross-platform. As long as you support the HTTP protocol and can parse XML, you can use Web Service. In my opinion, its use of XML for data encapsulation results in large data packets and relatively poor performance.

Through the examples above, what I mainly want to tell you is that RPC is not a product of the Internet age or a technology that emerged after service-oriented architecture. It is a specification, and any technology that encapsulates the details of network calls and enables remote invocation of other services can be considered an RPC technology.

So, what changes will occur for your vertical e-commerce project after using an RPC framework?

From my perspective, the changes in performance are significant. Let me give you an example. Suppose your e-commerce system’s product detail page requires product data, comment data, and store data. In an integrated architecture, you only need to retrieve the data from the product database, comment database, and store database, resulting in three network requests, not considering caching.

However, if you separate the product service, comment service, and store service, you need to call each of them separately, and each of these services will call their respective databases. This results in six network requests. If you further decompose the services into finer-grained components, the number of additional network calls will increase, resulting in longer request delays. This is the cost you pay in terms of performance to improve system scalability.

Now, if we want to optimize the performance of RPC in order to minimize network calls and their impact on performance, we need to understand the steps involved in a single RPC invocation. Only then can we propose optimization solutions for potential performance bottlenecks in these steps. The steps are as follows:

During an RPC invocation, the client first serializes information such as the class name, method name, parameter name, and parameter value into binary form.

Then, the client sends the binary data to the server through the network.

Upon receiving the binary data, the server deserializes it to obtain the class name, method name, parameter name, and parameter value to be invoked. It then dynamically invokes the corresponding method using proxy objects to obtain the return value.

The server serializes the return value and sends it back to the client through the network.

The client deserializes the result and obtains the invocation result.

The process is shown in the following diagram:

From this diagram, you can see that there are processes for network transmission, request serialization and deserialization. Therefore, to improve the performance of an RPC framework, optimization needs to be done in two aspects: network transmission and serialization.

How to Improve Network Transfer Performance #

In network transfer optimization, the first thing you need to do is choose a high-performance I/O model. The so-called I/O model refers to the way we handle I/O. Generally, a single I/O request is divided into two stages, and each stage requires a different approach to handle I/O.

First, there is a stage of waiting for resources. For example, waiting for network data to be available. During this process, we have two ways to handle I/O:

Blocking: The I/O request is blocked until the data is returned when the data is not available.
Non-blocking: The I/O request returns immediately when the data is not available until it is notified that the resource is available.

Next, there is a stage of using resources. For example, receiving data from the network and copying it to the application’s buffer. During this stage, we also have two ways to handle I/O:

Synchronous processing: The I/O request is blocked when reading or writing data until the data reading or writing is completed.
Asynchronous processing: The I/O request returns immediately when reading or writing data. After the operating system completes the I/O request and copies the data to the buffer provided by the user, it notifies the application that the I/O request is complete.

By combining these two stages and the four processing methods, and making some complementary adjustments, we get the five common I/O models:

Synchronous blocking I/O
Synchronous non-blocking I/O
Synchronous multiplexing I/O
Signal-driven I/O
Asynchronous I/O

You need to understand the differences and characteristics of these five I/O models. However, you may have some difficulty in understanding them, so let me give you an analogy to make it easier to understand.

Let’s compare the I/O process to the process of boiling and pouring water, with waiting for resources being boiling water, and using resources being pouring water:

If you stand by the stove and wait (waiting for resources) for the water to boil, and then pour the water (using resources), it is synchronous blocking I/O.
If you take a bit of a break and watch TV on the couch while the water is boiling (no longer waiting for resources all the time), but you still check from time to time if the water has boiled, and immediately pour the water once it has, it is synchronous non-blocking I/O.
If you want to take a shower and need to boil multiple kettles of water at the same time, you check which kettle has boiled during your TV break (waiting for multiple resources), and pour the water from the first kettle that has boiled. This speeds up the boiling process, and this is synchronous multiplexing I/O.
But you find that you’re constantly running to the kitchen to check if the water has boiled, and it’s tiring. So you consider adding an alarm (signal) to your kettle, so that you can pour the water as soon as it boils. This is signal-driven I/O.
The last one is advanced: you invent a smart kettle that automatically pours the water once it has boiled. This is asynchronous I/O.

Among these five I/O models, the most widely used one is multiplexing I/O. The select and epoll system calls in the Linux system support the multiplexing I/O model, and the high-performance network framework Netty in Java also defaults to using this model. So, we can choose it.

So, does choosing a high-performance I/O model enable efficient data transmission over the network? The truth is, it’s not that simple. Network performance tuning involves many aspects, and one important aspect not to be ignored is the tuning of network parameters. Next, I’ll introduce you to a typical example. Of course, you can combine your knowledge of networking and the source code of mature RPC frameworks (such as Dubbo) to gain a deeper understanding of the various aspects of network parameter tuning.

In a previous project, my team developed a simple RPC communication framework. When testing, we found that the average response time for remotely calling an empty method was several tens of milliseconds, which clearly did not meet our expectations. We believed that running an empty method should return within 1 millisecond. So, during testing, I captured packets using tcpdump and found that it took 40ms for an Ack packet to return for a single request. After searching online for the reason, I found that it was related to a parameter called tcp_nodelay. What is the purpose of this parameter?

The TCP header is 20 bytes, and the IP header is also 20 bytes. If only 1 byte of data is transmitted, a total of 20 + 20 + 1 = 41 bytes will be transmitted over the network. Among them, there is only 1 byte of useful data, which is a great waste of efficiency and bandwidth. Therefore, in 1984, John Nagle proposed the Nagle’s algorithm named after him. He expected that:

If there are consecutive small data packets, and the size of each packet is not equal to or greater than the Maximum Segment Size (MSS), and no Ack information for previously sent packets has been received, these small packets will be temporarily stored at the sender until they accumulate to an MSS or until an Ack is received.

This was originally intended to reduce unnecessary network transmissions. However, if the receiver has enabled Delayed ACK (sending ACK after a delay, which can combine multiple ACKs to improve network transmission efficiency), the following situation may occur: After the sender sends the first packet, if the receiver does not return an ACK, the sender sends the second packet. Because of the existence of Nagle’s algorithm, and the ACK for the first sent packet has not been returned yet, the second packet is temporarily stored. The timeout for Delayed ACK is usually 40ms, so once 40ms has elapsed, the receiver sends an ACK to the sender, and then the sender sends the second packet. This increases the latency.

The solution is very simple: Just enable tcp_nodelay on the socket. This parameter disables Nagle’s algorithm, so the sender does not need to wait for the ACK of the previous packet before sending a new packet. This is very suitable for scenarios with strong network interaction. Basically, if you want to implement your own network framework, it is best to enable this tcp_nodelay parameter.

Choosing the Right Serialization Method #

After optimizing the network data transmission, another important aspect to consider is data serialization and deserialization. Serialization refers to the process of converting transfer objects into binary strings, while deserialization is the reverse process of converting binary strings back into objects.

As you can see from the above RPC call process, each RPC call involves two data serialization processes and two data deserialization processes. This indicates that they have a significant impact on the performance of RPC. So, what factors should be considered when choosing a serialization method?

Firstly, performance is undoubtedly an important aspect to consider. Performance includes both time and space overhead. Time overhead refers to the speed of serialization and deserialization, which is obviously a key consideration. On the other hand, space overhead refers to the size of the binary string after serialization. A large binary string can occupy network bandwidth and affect transmission efficiency.

Apart from performance, we also need to consider whether the serialization method can be used across different languages and platforms. This is also very important because most companies have diverse technology ecosystems and use multiple programming languages. If the data transmitted by your RPC framework can only be interpreted by one language, it will limit the usability of the framework.

Moreover, extensibility is another significant aspect to consider. Imagine a scenario where adding a field to an object leads to an incompatible transmission protocol, resulting in failed service calls. This would be a disastrous situation.

Based on the above considerations, in my opinion, the main serialization options we have are as follows:

Firstly, there is JSON, which is well-known and widely used. It originated from JavaScript and is a simple and user-friendly serialization protocol. It is human-readable and has better performance compared to XML.

Thrift and Protobuf, on the other hand, require the use of an Interface Definition Language (IDL). This involves writing an IDL file according to a specific syntax and then using a dedicated compiler to convert it into code corresponding to different programming languages, thereby achieving cross-language compatibility.

Thrift is a high-performance serialization protocol developed by Facebook and is also a lightweight RPC framework. Protobuf is a serialization protocol open-sourced by Google. They share the common characteristic of high performance in terms of both time and space, but the presence of IDL brings some inconvenience in terms of usage.

So, how do you choose among these serialization protocols? Here are a few suggestions:

If performance requirements are not high and the transmitted data does not occupy much bandwidth, you can use JSON as the serialization protocol.

If performance requirements are high, both Thrift and Protobuf are suitable options. Additionally, since Thrift provides a complete RPC framework, if you are looking for an integrated solution, you may give priority to Thrift.

In some storage scenarios, such as when the data stored in your cache takes up a large amount of space, you can consider using Protobuf instead of JSON as the serialization method for storing the data.

Course Summary #

In order to optimize the performance of an RPC framework, in this lesson, I introduced the selection of network I/O models and serialization methods, which are key elements in implementing a high-concurrency RPC framework. In summary, there are three key points:

Choose a high-performance I/O model. I recommend using the synchronous multiplexing I/O model.
Debug network parameters. There are some recommended values based on experience. For example, setting tcp_nodelay to true, and tuning parameters such as the size of the receive buffer and send buffer, and the size of the client’s connection request buffer queue (backlog).
Select a serialization protocol based on specific business needs. If performance requirements are not high, you can choose JSON; otherwise, you can choose either Thrift or Protobuf.

While studying this lesson, I suggest that you take a look at the source code of mature RPC frameworks. For example, the open-source Dubbo by Alibaba, Motan by Weibo, and so on. By understanding their implementation principles and details, you will be more confident in maintaining your microservice system. At the same time, you can also learn coding techniques from excellent code, such as Dubbo’s abstraction of RPC and the design of SPI extension points. This will help improve your coding abilities.

Of course, in this lesson, I not only wanted to introduce some principles of RPC framework implementation, but also wanted to help you understand the key points to consider when doing network programming. This way, when designing systems of this type, you will have some directions and ideas to consider.