04 Network Communication Which Network Io Model Does the Rpc Framework Lean More Towards in Network Communication

04 Network Communication - Which network IO model does the RPC framework lean more towards in network communication #

Hello, I’m He Xiaofeng. In the previous lesson, I explained serialization in the RPC framework. Through the previous lesson, we learned that since the data transmitted over the network is binary data, in order to transmit objects, we must serialize the objects. In the selection of serialization in the RPC framework, we pay more attention to the security, universality, and compatibility of the serialization protocol, and then consider the performance, efficiency, and space overhead of the serialization protocol. Continuing from the previous lesson, in this lesson, I will specifically explain the network communication in the RPC framework, which is also an important content that we emphasized in the introduction.

So, what is the role of network communication in RPC invocation?

I mentioned in [Lesson 01] that RPC is a way to solve inter-process communication. A single RPC invocation is essentially a process of exchanging network information between the service consumer and the service provider. The service caller sends a request message through network IO, the service provider receives and parses it, processes the relevant business logic, then sends a response message back to the service caller, and the service caller receives and parses the response message, and processes the relevant response logic, thus ending a RPC invocation. It can be said that network communication is the foundation of the entire RPC invocation process.

Common network IO models #

When it comes to network communication, we have to mention network IO models. Why talk about network IO models? Because the so-called network communication between two PCs is actually the operation of network IO between the two PCs.

The common network IO models can be divided into four types: synchronous blocking IO (BIO), synchronous non-blocking IO (NIO), IO multiplexing, and asynchronous non-blocking IO (AIO). Among these four IO models, only AIO is asynchronous, while the others are synchronous.

Among them, the most commonly used ones are synchronous blocking IO and IO multiplexing. By understanding their mechanisms, you will get it. As for the other two IO models, they are not commonly used and will not be the focus of this lecture, but we can discuss them in the comments if you are interested.

Blocking IO #

Synchronous blocking IO is the simplest and most common IO model. By default, all sockets in Linux are blocking. Let’s take a look at the operation process.

First, when the application process initiates an IO system call, the application process is blocked and the control is transferred to the kernel space for processing. Then, the kernel starts waiting for data. After the data is received, the kernel copies the data from the kernel space to the user space. After the entire IO processing is completed, the process is returned. Finally, the application process is unblocked and can run the business logic.

Here we can see that the system kernel processes the IO operations in two stages: waiting for data and copying data. During these two stages, the IO operation thread in the application process will always be in a blocked state. If you are using Java multithreading for development, each IO operation will occupy a thread until the IO operation is completed.

This process is similar to going to a restaurant for a meal. When we arrive at the restaurant and place an order with the waiter, we have to wait in the restaurant until the kitchen prepares the dishes. Then the waiter will serve the dishes to us, and we can enjoy the meal.

IO multiplexing #

IO multiplexing is the most widely used IO model in high-concurrency scenarios, such as Java’s NIO, Redis, and Nginx’s underlying implementations. The classic Reactor pattern is also based on this IO model.

So what is IO multiplexing? From a literal understanding, multiplexing refers to multiple channels, which are multiple network connections’ IO operations, and multiplexing refers to multiple channels being multiplexed on one multiplexer.

Multiple network connections’ IO operations can be registered to one multiplexer (select). When the user process calls select, the entire process will be blocked. At the same time, the kernel “monitors” all the sockets that select is responsible for. When the data in any socket is ready, select will return. At this time, the user process calls the read operation to copy the data from the kernel to the user process. Here we can see that when the user process initiates a select call, the process will be blocked until it finds that the socket responsible for the select call has ready data, and then it will initiate a read call. The whole process is more complex than blocking I/O and seems to waste more performance. However, its biggest advantage is that users can handle multiple socket I/O requests in a single thread. Users can register multiple sockets and then continuously call select to read the activated sockets, thus achieving the goal of handling multiple I/O requests in the same thread. In the synchronous blocking model, this can only be achieved through multi-threading.

It’s like going to a restaurant to eat. This time we went together. We specifically left someone at the restaurant to wait in line, while the others went shopping. When the person waiting in line notifies us that we can eat, we can directly enjoy the meal.

Why are blocking I/O and I/O multiplexing the most commonly used? #

After understanding the mechanisms of both, we can now return to the initial question - why do I say that blocking I/O and I/O multiplexing are the most commonly used. Comparing these four network I/O models: blocking I/O, non-blocking I/O, I/O multiplexing, and asynchronous I/O. In actual network I/O applications, what is needed is support from the system kernel and programming language.

In terms of system kernel support, most modern systems support blocking I/O, non-blocking I/O, and I/O multiplexing. However, signal-driven I/O and asynchronous I/O are only supported by higher versions of the Linux kernel.

In terms of programming languages, whether it is C++ or Java, in the development of high-performance network programming frameworks, most of them are based on the Reactor pattern, with Java’s Netty framework being the most typical. The Reactor pattern is based on I/O multiplexing. Of course, in non-high-concurrency scenarios, synchronous blocking I/O is the most common.

To sum it up, in these four commonly used I/O models, the most widely used and well-supported by the system kernel and programming language are blocking I/O and I/O multiplexing. These two I/O models can meet the requirements of the majority of network I/O applications.

Which network I/O model is preferred in RPC frameworks for network communication? #

Now that we have discussed these two most commonly used network I/O models, let’s see which scenarios they are more suitable for.

I/O multiplexing is more suitable for high-concurrency scenarios, where a limited number of processes or threads can handle a large number of socket I/O requests. However, it is more difficult to use. Advanced programming languages often provide good support. For example, Java has many open-source frameworks that encapsulate the native API, such as the Netty framework, which is very easy to use. On the other hand, Go language itself has concise encapsulation of I/O multiplexing.

In comparison, blocking I/O, when compared to I/O multiplexing, blocks the entire process (thread) for each socket I/O request, but it is easier to use. In scenarios with low concurrency and where the business logic only requires synchronous I/O operations, blocking I/O is sufficient and does not require initiating a select call, resulting in lower overhead compared to I/O multiplexing.

In most cases, RPC calls involve high-concurrency scenarios. Considering the support of the system kernel, programming language, and the characteristics of the I/O model itself, in the implementation of RPC frameworks, we choose I/O multiplexing for network communication processing. When choosing a network communication framework in a development language, our optimal choice is a framework based on the Reactor pattern, such as the Netty framework for Java (there are many other NIO frameworks for Java, but currently Netty is the most widely used), and on Linux, we need to enable epoll to improve system performance (epoll cannot be enabled in a Windows environment because the system kernel does not support it).

After understanding the above content, we can continue to look at another critical question - zero-copy. It is very important in our application.

What is zero-copy? #

I mentioned before when talking about blocking I/O that the system kernel processes I/O operations in two stages - waiting for data and copying the data. Waiting for data means that the system kernel waits for the network card to receive data and writes the data into the kernel. Copying the data means that after the system kernel obtains the data, it copies the data into the user process’s space. Here is the specific process:

Every write operation by the application process writes the data into the user space buffer, and then the CPU copies the data into the kernel buffer. After that, DMA copies this data to the network card, and finally the network card sends it out. Here we can see that data from one write operation needs to be copied twice before it can be sent out through the network card. Similarly, the read operation of the user process reverses the entire process, and the data also needs to be copied twice before the application program can read the data.

Every complete read and write operation of the application process needs to copy data back and forth between the user space and the kernel space, and for each copy, the CPU needs to perform a context switch (switching from the user process to the system kernel, or vice versa). Isn’t this a waste of CPU and performance? Is there any way to reduce data copying between processes and improve data transmission efficiency?

This is where zero-copy technology comes in.

The so-called zero-copy means canceling the data copying operation between the user space and the kernel space. Each read and write operation of the application process can allow the application process to write or read data into the user space in a way that is just like writing or reading data directly into the kernel space, and then use DMA to copy the data in the kernel to the network card or copy the data in the network card to the kernel.

So how can we achieve zero-copy? Have you thought that if data is written by both the user space and the kernel space to the same location, no copying is needed? Have you thought of virtual memory?

There are two ways to achieve zero-copy, namely mmap+write and sendfile. The core principle of the mmap+write approach is to solve it through virtual memory. These two implementation methods are not difficult, and there are many resources available for reference on the market, so I won’t go into details here. If you have any questions, you can resolve them in the comments section.

Zero Copy in Netty #

After understanding zero copy, let’s take a look at zero copy in Netty.

I mentioned earlier that when selecting a network communication framework for an RPC framework, the best choice is a framework implemented based on the Reactor pattern, such as Netty in Java. So, does Netty have zero copy mechanism? What is the difference between zero copy in Netty and the zero copy I mentioned earlier?

The zero copy I mentioned earlier is at the operating system level, mainly aimed at avoiding data copy operations between user space and kernel space, which can improve CPU utilization.

The zero copy in Netty, on the other hand, is different. It is completely in user space, which is the JVM. Its zero copy primarily focuses on optimizing data operations.

So, what is the significance of Netty doing this?

Recall [Lesson 02]. In this lesson, I explained how to design a protocol for an RPC framework. I mentioned that during transmission, RPC does not send all the binary data of the request parameters to the remote machine at once. It may be split into several data packets or combined with data packets of other requests, so there needs to be message boundaries. After one end receives the message, it needs to process the data packets, split and merge them based on the boundaries, and eventually obtain a complete message.

So, is the splitting and merging of data packets done in user space or kernel space after receiving the message?

Of course it is in user space, because the processing of data packets is done by the application program. Is it possible to have data copy operations here? It is possible, but not the copy between user space and kernel space. It is the copying and processing operations in the user space’s internal memory. Netty’s zero copy is designed to solve this problem and optimize data operations in user space.

So how does Netty optimize data operations?

Netty provides the CompositeByteBuf class, which can combine multiple ByteBufs into one logical ByteBuf, avoiding copying between individual ByteBufs.
ByteBuf supports slice operations, so it can be divided into multiple ByteBufs that share the same storage area, avoiding memory copying.
With the wrap operation, we can wrap byte[] arrays, ByteBufs, ByteBuffers, and other objects into a Netty ByteBuf object, thus avoiding copying operations.

Many internal ChannelHandler implementation classes in Netty use CompositeByteBuf, slice, and wrap operations to handle the problem of packet splitting and merging in TCP transmission.

Does Netty have a solution to the data copy problem between user space and kernel space?

Netty’s ByteBuffer can use Direct Buffers to read and write sockets using off-heap memory. The final effect is the same as the effect achieved by virtual memory I mentioned earlier.

Netty also provides FileRegion to wrap NIO’s FileChannel.transferTo() method, which achieves zero copy. This is also similar to the sendfile method in Linux in terms of principle.

Summary #

Today we have provided a detailed introduction to blocking IO and IO multiplexing, expanded on the knowledge of zero-copy, and discussed zero-copy in the Netty framework.

Considering the support of the system kernel, the support of programming languages, and the characteristics of the IO model itself, when it comes to handling network communication in RPC frameworks, we are more inclined to choose the IO multiplexing approach.

The benefit of zero-copy is to avoid unnecessary CPU copying, allowing the CPU to be freed up to do other things. It also reduces the context switching between the user space and kernel space, thereby improving network communication efficiency and the overall performance of the application.

However, there are some differences between Netty’s zero-copy and the zero-copy of the operating system. Netty’s zero-copy focuses on optimizing data operations in the user space, which is of great significance for handling packet splitting and merging issues in TCP transmission. It is also important for handling request and response data in application programs.

In the development and use of RPC frameworks, we need to have a deep understanding of the principles of network communication and strive to achieve zero-copy, such as using the Netty framework. We should also use ByteBuf subclasses reasonably to achieve complete zero-copy and improve the overall performance of the RPC framework.

Reflection after class #

Think back to the open source middleware frameworks you have encountered. Which frameworks have achieved zero-copy on network communication? And what methods were used to achieve zero-copy?

Feel free to leave a comment and share your thoughts and questions with me. You are also welcome to share this article with your friends and invite them to join the learning experience. See you in the next class!