16 Io Acceleration and the Distinct Netty Zero Copy Technology

16 IO acceleration and the distinct Netty zero-copy technology #

In today’s lesson, we will continue discussing another advanced feature of Netty for achieving high performance: zero copy. Zero copy is a well-known term and is commonly used in popular products such as Linux, Kafka, and RocketMQ to improve I/O performance. It is also a frequently asked question in interviews. Do you know where zero copy is applied? How does Netty implement zero copy? Let’s analyze the details of Netty zero copy feature.

Traditional Zero Copy Technology in Linux #

Before introducing the zero copy feature of Netty, it is necessary to learn about the working principle of traditional zero copy in Linux. The concept of zero copy means that data can be operated without copying it from one memory location to another. This reduces the overhead of memory copying, saving CPU clock cycles and memory bandwidth.

Let’s simulate a scenario where data is read from a file and then transferred to the network. What are the stages of the traditional data copying process? The specific process is shown in the following diagram.

Drawing 0.png

From the diagram, we can see that a total of four data copies are involved in the process from data reading to sending:

  1. When the user process initiates the read() call, the context switches from user mode to kernel mode. The DMA engine reads data from the file and stores it in the kernel buffer, which is the first data copy.
  2. The requested data is copied from the kernel buffer to the user buffer and then returned to the user process. The second data copy process also causes a context switch from kernel mode back to user mode.
  3. When the user process calls the send() method to send the data to the network, it triggers the third thread switch. The requested data is copied from the user buffer to the socket buffer.
  4. Finally, when the send() system call finishes and returns to the user process, a fourth context switch occurs. The fourth copy is executed asynchronously, copying data from the socket buffer to the protocol engine.

Note: DMA (Direct Memory Access) is a feature supported by most modern hard drives. It takes over the work of data reading and writing, relieving the CPU from handling I/O interrupts.

Why isn’t the data directly transferred to the user buffer in the traditional data copying process? In fact, introducing the kernel buffer can serve as a cache, allowing pre-reading of file data to improve I/O performance. However, when the requested data size exceeds the kernel buffer size, multiple copies may be required to complete the process from data reading to sending, resulting in significant performance loss.

Next, let’s introduce the process of data transmission after using zero copy technology. Reviewing the traditional data copying process, we can see that the second and third copies can be eliminated. When the DMA engine reads data from the file and puts it into the kernel buffer, it can then be directly transmitted to the socket buffer without additional memory copies, reducing the number of memory copies.

In Linux, the system call sendfile() can transfer data from one file descriptor to another, achieving zero copy technology. In Java, zero copy technology is also used in the NIO FileChannel class’s transferTo() method. The transferTo() method relies on the operating system’s zero copy mechanism to transfer data directly from a FileChannel to another Channel. The definition of the transferTo() method is as follows:

public abstract long transferTo(long position, long count, WritableByteChannel target) throws IOException;

The usage of FileChannel#transferTo() is also very simple. Let’s take a look at the following code example. By using transferTo(), we can transfer from.data to to.data(), achieving file copying:

public void testTransferTo() throws IOException {
RandomAccessFile fromFile = new RandomAccessFile("from.data", "rw");

FileChannel fromChannel = fromFile.getChannel();

RandomAccessFile toFile = new RandomAccessFile("to.data", "rw");

FileChannel toChannel = toFile.getChannel();

long position = 0;

long count = fromChannel.size();

fromChannel.transferTo(position, count, toChannel);
}

After using FileChannel#transferTo() to transfer data, let’s see what changes have occurred in the data copying process, as shown in the following figure:

Drawing 1.png

One major change is that after the DMA engine reads the data from the file and copies it to the kernel-state buffer, it is directly copied to the socket buffer by the operating system, without being copied to the user-state buffer. Therefore, the number of data copies is reduced from 4 times to 3 times.

However, there is still room for improvement to achieve zero-copy. After the Linux 2.4 version, developers appended some descriptor information to the socket buffer to further reduce the copying of kernel data. As shown in the figure below, the DMA engine reads the file content and copies it to the kernel buffer, but does not copy it to the socket buffer. Instead, the length and position information of the data are appended to the socket buffer. The DMA engine then reads the data directly from the kernel buffer based on this descriptor information and transfers it to the protocol engine, thus eliminating the last CPU copy.

Drawing 2.png

After learning about the zero-copy technology in Linux, you may still have a question. After using zero-copy, isn’t there still a data copying operation? From the perspective of the Linux operating system, zero-copy is aimed at avoiding data copying between user space and kernel space. Whether it is traditional data copying or zero-copy technology, there are 2 DMA data copying operations that cannot be eliminated. However, these 2 DMA copies are completed by the hardware without the need for the CPU’s involvement. Therefore, the term zero-copy discussed here is a broad concept, and any approach that reduces unnecessary CPU copies can be called zero-copy.

Zero-Copy Technology in Netty #

After discussing traditional zero-copy technology in Linux, let’s now learn about how zero-copy is implemented in Netty. Zero-copy in Netty is somewhat different from traditional zero-copy in Linux. In Netty, zero-copy technology not only encapsulates the functionality at the operating system level, but also focuses more on optimizing data operations at the user level. The main aspects of zero-copy in Netty are as follows:

  • Off-heap memory to avoid data copying between JVM heap memory and off-heap memory.
  • The CompositeByteBuf class, which can combine multiple Buffer objects into one logical object, avoiding the need to copy several Buffer objects into a large Buffer using traditional memory copying.
  • The Unpooled.wrappedBuffer method allows you to wrap a byte array into a ByteBuf object without memory copying.
  • The ByteBuf.slice operation, which is the opposite of the Unpooled.wrappedBuffer. The slice operation can split a ByteBuf object into multiple ByteBuf objects, sharing the same underlying byte array storage without memory copying.
  • Netty uses FileRegion to implement file transfer. FileRegion encapsulates the FileChannel#transferTo() method, which allows direct transfer of file buffer data to the target Channel, avoiding data copying between the kernel buffer and the user space buffer. This belongs to the operating system-level zero-copy. Below, we will introduce the following 5 aspects one by one.

Off-heap Memory #

When performing I/O operations inside the JVM, the data must be copied to off-heap memory before system calls can be executed. This is a problem that all VM languages ​​will encounter. So why can’t the operating system directly use the JVM heap memory for I/O read and write? There are two main reasons: first, the operating system is not aware of the JVM heap memory, and the memory layout of the JVM is different from that allocated by the operating system. The operating system will not read and write data according to the behavior of the JVM. Second, the memory address of the same object may change at any time with the execution of JVM GC. For example, during JVM GC, memory compression is used to reduce memory fragmentation, which involves the issue of object movement.

Netty uses off-heap memory for I/O operations, which avoids the need to copy data from JVM heap memory to off-heap memory.

CompositeByteBuf #

CompositeByteBuf is a very important data structure in Netty for implementing zero-copy mechanisms. CompositeByteBuf can be understood as a virtual Buffer object composed of multiple ByteBufs. However, CompositeByteBuf internally maintains the reference relationship of each ByteBuf, forming a whole logically. For example, commonly used HTTP protocol data can be divided into header and body, which are stored in two different ByteBufs. Usually, we need to merge the two ByteBufs into a complete protocol data for sending, which can be done as follows:

ByteBuf httpBuf = Unpooled.buffer(header.readableBytes() + body.readableBytes());
    
httpBuf.writeBytes(header);
    
httpBuf.writeBytes(body);

As can be seen, to merge the header and body ByteBufs, we need to initialize a new httpBuf first, and then copy the header and body to the new httpBuf separately. During the merge process, there are two CPU copies involved, which is very wasteful of performance. How can we achieve similar requirements using CompositeByteBuf? It can be done as follows:

CompositeByteBuf httpBuf = Unpooled.compositeBuffer();
    
httpBuf.addComponents(true, header, body);

CompositeByteBuf adds multiple ByteBufs by calling the addComponents() method, but the underlying byte array is reused and no memory copy occurs. However, for users, it can be operated as a whole. So how does CompositeByteBuf store these ByteBufs and how does it merge them? Let’s first look at the internal structure of CompositeByteBuf through the following diagram:

Drawing 3.png

From the diagram, we can see that CompositeByteBuf maintains an array called Components internally. Each Component holds a different ByteBuf, and each ByteBuf maintains its own separate read and write indexes. CompositeByteBuf also maintains a separate read and write index. It can be seen that Component is the key to implementing CompositeByteBuf. Let’s take a look at the Component structure definition below:

private static final class Component {
    
    final ByteBuf srcBuf; // Original ByteBuf
    
    final ByteBuf buf; // ByteBuf without wrapper
    
    int srcAdjustment; // The offset of CompositeByteBuf's start index relative to srcBuf's read index
    
    int adjustment; // The offset of CompositeByteBuf's start index relative to buf's read index
    
    int offset; // Component's start index position relative to CompositeByteBuf
    
    int endOffset; // Component's end index position relative to CompositeByteBuf
    
    // Other code omitted
    
}

To facilitate understanding the meaning of the attributes in the above Component, I will also use the header and body in the HTTP protocol as an example. Through the following diagram, I will describe the layout of the Component after CompositeByteBuf is composed: Drawing 4.png From the diagram, we can see that header and body correspond to two ByteBufs. Assuming that the content of the ByteBufs is “header” and “body”, then the offset~endOffset of the header ByteBuf is 0~6, and the offset~endOffset of the body ByteBuf is 0~10. It can be seen that the offset and endOffset in Component can represent the range that the current ByteBuf can read. The offset and endOffset can connect each Component’s corresponding ByteBuf together to form a logical whole.

In addition, srcAdjustment and adjustment in Component indicate the offset of CompositeByteBuf’s start index relative to the ByteBuf’s read index. The initial adjustment = readIndex - offset. With this, the start index of CompositeByteBuf can directly locate the read index position of the ByteBuf through Component. What happens when the header ByteBuf reads 1 byte and the body ByteBuf reads 2 bytes? The properties of each Component will change as shown in the following diagram. Drawing 5.png

So far, we have finished introducing the basic principles of CompositeByteBuf. We will not go into the details of specific operations on CompositeByteBuf data here. Interested students can study the source code of CompositeByteBuf on their own.

Unpooled.wrappedBuffer Operation #

After introducing CompositeByteBuf, it is much easier to understand the Unpooled.wrappedBuffer operation. Unpooled.wrappedBuffer is also another recommended way to create a CompositeByteBuf object.

Unpooled provides a series of wrappedBuffer methods for wrapping data sources, as shown below: Drawing 6.png

The Unpooled.wrappedBuffer method can wrap one or more data sources of different types, such as byte[], ByteBuf, ByteBuffer, into a large ByteBuf object. No data copying occurs during the wrapping process, and the generated ByteBuf object after wrapping shares the underlying byte array with the original ByteBuf object.

ByteBuf.slice operation #

The logic of ByteBuf.slice is exactly the opposite of Unpooled.wrappedBuffer. ByteBuf.slice divides a ByteBuf object into multiple ByteBuf objects that share the same underlying storage.

ByteBuf provides two slice methods:

public ByteBuf slice();

public ByteBuf slice(int index, int length);

Suppose we already have a complete set of HTTP data. We can use the slice method to obtain two ByteBuf objects: header and body. The corresponding contents are “header” and “body”. The implementation is as follows:

ByteBuf httpBuf = ...

ByteBuf header = httpBuf.slice(0, 6);

ByteBuf body = httpBuf.slice(6, 4);

After slicing, a new ByteBuf object will be returned, and the new object will have its own independent readerIndex and writerIndex. Because the new ByteBuf object shares data with the original ByteBuf object, any data operations performed on the new ByteBuf object will also take effect on the original ByteBuf object.

图片8.png

File transfer with FileRegion #

In the example package of Netty’s source code, there is an example of using FileRegion. The following code snippet is taken from FileServerHandler.java.

@Override

public void channelRead0(ChannelHandlerContext ctx, String msg) throws Exception {

    RandomAccessFile raf = null;

    long length = -1;

    try {

        raf = new RandomAccessFile(msg, "r");

        length = raf.length();

    } catch (Exception e) {

        ctx.writeAndFlush("ERR: " + e.getClass().getSimpleName() + ": " + e.getMessage() + '\n');

        return;

    } finally {

        if (length < 0 && raf != null) {

            raf.close();

        }

    }

    ctx.write("OK: " + raf.length() + '\n');

    if (ctx.pipeline().get(SslHandler.class) == null) {

        // SSL not enabled - can use zero-copy file transfer.

        ctx.write(new DefaultFileRegion(raf.getChannel(), 0, length));

    } else {

        // SSL enabled - cannot use zero-copy file transfer.

        ctx.write(new ChunkedFile(raf));

    }

    ctx.writeAndFlush("\n");

}

From the usage example of FileRegion, we can see that Netty uses FileRegion to achieve zero-copy file transfer. The default implementation class of FileRegion is DefaultFileRegion, which writes the file content to NioSocketChannel. So how does FileRegion achieve zero-copy? Let’s take a look at the source code of FileRegion.

public class DefaultFileRegion extends AbstractReferenceCounted implements FileRegion {

    private final File f; // File to transfer

    private final long position; // Start position of the file

    private final long count; // Number of bytes to transfer

    private long transferred; // Number of bytes transferred

    private FileChannel file; // FileChannel corresponding to the file
    
    @Override

    public long transferTo(WritableByteChannel target, long position) throws IOException {

        long count = this.count - position;

        if (count < 0 || position < 0) {

            throw new IllegalArgumentException(

                    "position out of range: " + position +

                    " (expected: 0 - " + (this.count - 1) + ')');

        }

        if (count == 0) {

            return 0L;

        }

        if (refCnt() == 0) {

            throw new IllegalReferenceCountException(0);

        }

        open();

        long written = file.transferTo(this.position + position, count, target);

        if (written > 0) {

            transferred += written;

        } else if (written == 0) {

            validate(this, position);

        }

        return written;

    }

    // Omitted other code

}

From the source code, we can see that FileRegion is actually a wrapper for FileChannel and does not perform any special operations. It uses the transferTo() method in JDK NIO’s FileChannel to achieve file transfer. Therefore, FileRegion is zero-copy at the operating system level and is very helpful for transferring large files.

At this point, all the zero-copy technologies related to Netty have been introduced. It can be seen that Netty has made more advanced designs and optimizations for ByteBuf.

Summary #

Zero-copy is a commonly used technique in network programming to optimize the performance of network data transmission. This article introduces the zero-copy technology in the Linux operating system and Netty. In addition to supporting operating system-level zero-copy, Netty also provides user-level zero-copy features in five aspects: direct memory, CompositeByteBuf, Unpooled.wrappedBuffer, ByteBuf.slice, and FileRegion. From the perspective of the operating system, zero-copy is a broad concept, and it can be understood as any approach that reduces unnecessary CPU copies.

Finally, let’s leave a question for consideration: Does using the transfer() method with zero-copy capabilities necessarily make file copying more efficient than traditional I/O?