10 Netty Primer Using It for Network Programming Is Good Below

10 Netty Primer Using It for Network Programming is Good Below #

In the previous lesson, we provided a brief introduction to the design of Netty from the perspectives of I/O models and thread models. In this lesson, we will dive deeper into the internal workings of Netty and introduce the functionality of its core components, as well as outline their implementation principles, allowing you to gain a better understanding of Netty’s kernel.

We will continue using the same approach as before to introduce Netty’s core components. First, we will abstract the concepts of I/O models used in Netty, such as the Selector component. Next, we will introduce the components related to the thread model, mainly NioEventLoop and NioEventLoopGroup. Lastly, we will delve into the components related to data processing in Netty, such as ByteBuf and relevant knowledge about memory management.

Channel #

A Channel is the abstraction of a network connection in Netty, and its core function is to perform network I/O operations. Different protocols and different types of blocking connections correspond to different Channel types. The most commonly used Channel type is NIO Channel, and below are some commonly used NIO Channel types.

NioSocketChannel: Corresponds to an asynchronous TCP socket connection.
NioServerSocketChannel: Corresponds to an asynchronous server-side TCP socket connection.
NioDatagramChannel: Corresponds to an asynchronous UDP connection.

The aforementioned asynchronous Channels primarily provide asynchronous network I/O operations, such as connection establishment, read/write operations, etc. Asynchronous calls mean that any I/O call will immediately return and do not guarantee that the requested I/O operation will be completed when the call returns. The return of an I/O operation is a ChannelFuture object, and regardless of whether the I/O operation is successful or not, the Channel can notify the caller by using listeners. We can register listeners on the ChannelFuture to listen for the results of I/O operations.

Netty also supports synchronous I/O operations, but they are rarely used in practice. In the vast majority of cases, we use asynchronous I/O operations in Netty. Although a ChannelFuture object is returned immediately, we cannot know immediately whether the I/O operation was successful. In this case, we need to register a listener on the ChannelFuture, which will automatically trigger the registered listener event when the operation succeeds or fails.

In addition, a Channel also provides functionalities such as detecting the current network connection status, which can help us implement automatic reconnection after the network connection is abnormally disconnected.

Selector #

A Selector is an abstraction of a multiplexer and is also one of the core foundational components of Java NIO. Netty implements I/O multiplexing based on Selector objects. Inside the Selector, it continuously queries whether the Channels registered with it have any ready I/O events, such as readable events (OP_READ), writable events (OP_WRITE), or network connection events (OP_ACCEPT), without the need for user threads to do polling. In this way, we can use a single thread to listen for events occurring on multiple Channels.

ChannelPipeline & ChannelHandler #

When mentioning Pipeline, you might immediately think of the pipeline in Linux commands, which can implement passing the output of one command as the input of another command. The ChannelPipeline in Netty can also achieve similar functionality: ChannelPipeline passes the processed data from one ChannelHandler as the input to the next ChannelHandler.

The following figure, quoted from Netty Javadoc, describes how ChannelHandlers typically handle I/O events in the ChannelPipeline. Netty defines two types of events: inbound events and outbound events. Similar to the data in Linux pipelines, these two types of events are passed in the ChannelPipeline and may also contain attached data. Multiple ChannelHandlers (either ChannelInboundHandler or ChannelOutboundHandler) can be registered on top of the ChannelPipeline, and we determine the order in which I/O events are processed during ChannelHandler registration. This is the typical chain of responsibility pattern.

Drawing 0.png

From the figure, we can also see that I/O events do not automatically propagate within the ChannelPipeline. We need to use the appropriate methods defined in ChannelHandlerContext to propagate the events, such as the fireChannelRead() method and write() method.

Here, we provide a simple example as shown below. In this ChannelPipeline, we have added 5 ChannelHandler objects:

ChannelPipeline p = socketChannel.pipeline();

p.addLast("1", new InboundHandlerA());

p.addLast("2", new InboundHandlerB());

p.addLast("3", new OutboundHandlerA());

p.addLast("4", new OutboundHandlerB());

p.addLast("5", new InboundOutboundHandlerX());

For inbound events, the processing sequence is: 1 → 2 → 5.
For outbound events, the processing sequence is: 5 → 4 → 3. As you can see, the order of handling inbound and outbound events is exactly opposite.

Inbound events are generally triggered by I/O threads. For example, let’s say we have defined a custom message protocol where a complete message consists of a message header and a message body. The message header contains metadata such as message type, control bits, and data length, while the message body contains the actual data to be transmitted. When dealing with a large piece of data, the client usually splits the data into multiple messages and sends them, and the server, upon receiving the data, decodes and caches it. Once enough bytes of data are collected and assembled into a message with a fixed meaning, it will be passed to the next ChannelHandlerContext for further processing.

Netty provides many implementations of encoders to decode the read data. The encoder processes multiple channelRead() events and triggers the channelRead() method of the next ChannelInboundHandler only when meaningful data is obtained.

Outbound events are the opposite of inbound events and are generally triggered by users.

The ChannelHandler interface does not define methods to handle events, but it is handled by its subclasses. As shown in the figure below, the ChannelInboundHandler intercepts and handles inbound events, while the ChannelOutboundHandler intercepts and handles outbound events.

Drawing 1.png

The ChannelInboundHandlerAdapter and ChannelOutboundHandlerAdapter provided by Netty are mainly used to facilitate the flow of events, that is, to automatically call the corresponding methods for event transmission. Therefore, when implementing a custom ChannelHandler class, you can directly inherit the corresponding adapter class and override the required event handling methods, and use the default implementation for other events that are not of interest, thereby improving development efficiency.

Many methods in ChannelHandler require a ChannelHandlerContext parameter. ChannelHandlerContext abstracts the relationships between ChannelHandlers and between ChannelHandlers and ChannelPipelines. The event propagation in ChannelPipeline mainly depends on ChannelHandlerContext. In ChannelHandlerContext, the relationships between ChannelHandlers are maintained, so we can get the successor node of the current ChannelHandler from ChannelHandlerContext and propagate the event to the subsequent ChannelHandler.

ChannelHandlerContext inherits AttributeMap, so it provides the attr() method for setting and deleting some status attribute information. We can store the required state attributes in ChannelHandlerContext in business logic, and then these attributes can be propagated along with it. Channel also maintains an AttributeMap, which, starting from Netty 4.1, applies to the entire ChannelPipeline.

From the above analysis, we can understand that each Channel corresponds to a ChannelPipeline, and each ChannelHandlerContext corresponds to a ChannelHandler. As shown in the figure below:

Finally, it should be noted that if you want to execute time-consuming logic in ChannelHandler, such as operating on a database, performing network or disk I/O operations, etc., it is generally recommended to specify a thread pool when registering to the ChannelPipeline to execute operations in the ChannelHandler asynchronously.

NioEventLoop #

In the previous introduction to Netty’s thread model, we briefly mentioned the NioEventLoop component. At that time, for ease of understanding, we simply described it as a thread.

An EventLoop object is driven by a thread that never changes, and a NioEventLoop contains a Selector object, which can support multiple Channels registered on it. This NioEventLoop can serve multiple Channels simultaneously, and each Channel can only be associated with one NioEventLoop, thus establishing the association between threads and Channels.

As we know, I/O operations in Channel are handled by ChannelHandlers registered in ChannelPipeline, and the logic of ChannelHandlers is executed by the thread associated with the corresponding NioEventLoop.

In addition to being bound to a thread, NioEvenLoop also maintains two task queues:

Normal task queue. User-generated normal tasks can be submitted to this queue for temporary storage, and NioEventLoop will execute the tasks immediately when it detects them in the queue. This is a multi-producer, single-consumer queue. Netty uses this queue to collect tasks generated by external user threads together and execute them serially in the Reactor thread. For example, when an external non-I/O thread calls the write() method of a Channel, Netty encapsulates it as a task and puts it into the TaskQueue. In this way, all I/O operations are executed serially in the I/O thread.

Scheduled task queue. When the user generates a timed operation in a non-I/O thread, Netty encapsulates the user’s timed operation as a scheduled task and puts it into this scheduled task queue, waiting for the corresponding NioEventLoop to execute it serially.

From this analysis, we can see that NioEventLoop mainly does three things: listening to I/O events, executing normal tasks, and executing scheduled tasks. The amount of time NioEventLoop allocates to different types of tasks can be configured. Also, in order to prevent NioEventLoop from blocking on a task for a long time, time-consuming operations are generally submitted to other business thread pools for processing.

NioEventLoopGroup #

NioEventLoopGroup represents a group of NioEventLoops. In order to make full use of multi-core CPU resources, Netty usually has multiple NioEventLoops working at the same time, and the number of threads can be determined by the user. Netty calculates a default value based on the actual number of processor cores, with the specific calculation formula being: CPU core count * 2. Of course, we can also adjust it manually according to the actual situation.

When a Channel is created, Netty calls the next() method provided by NioEventLoopGroup to obtain one NioEventLoop instance according to certain rules, and registers the Channel with that NioEventLoop instance. After that, the events on the Channel will be handled by that NioEventLoop. The association among EventLoopGroup, EventLoop, and Channel is shown in the following figure:

Drawing 4.png

Previously, we mentioned that there are two NioEventLoopGroups in Netty’s server-side: BossEventLoopGroup and WorkerEventLoopGroup. Usually, one server port only needs one ServerSocketChannel, corresponding to one Selector and one NioEventLoop thread. BossEventLoop is responsible for receiving client connection events, such as OP_ACCEPT events, and then handing the created NioSocketChannel over to WorkerEventLoopGroup. WorkerEventLoopGroup will choose one of the NioEventLoopGroups from next() method and register this NioSocketChannel to the Selector maintained by it, and handle its subsequent I/O events.

As shown in the above figure, BossEventLoopGroup usually is a single-threaded EventLoop, which maintains a Selector object that registers a ServerSocketChannel on it. BossEventLoop will continuously poll the Selector to listen for connection events, and when a connection event occurs, it will establish a connection with the client through the accept operation and create a SocketChannel object. Then, it hands the SocketChannel obtained from the accept operation over to WorkerEventLoopGroup. In the Reactor pattern, WorkerEventLoopGroup maintains multiple EventLoops, and each EventLoop listens for I/O events that occur on the assigned SocketChannel and dispatches these specific events to the business thread pool for processing.

ByteBuf #

From the previous introduction, we understood the flow of data in Netty. Here, let’s introduce the container for data - ByteBuf.

When performing cross-process remote communication, we need to send and receive data in the form of bytes. Both the sender and the receiver need an efficient data container to cache the byte data, and ByteBuf plays the role of such a data container.

ByteBuf is similar to a byte array, which maintains a reader index and a writer index to control read and write operations on the data in ByteBuf. The two indexes satisfy the following inequality:

0 <= readerIndex <= writerIndex <= capacity

Drawing 6.png

ByteBuf provides read and write operation APIs mainly to operate on the underlying byte container (byte[], ByteBuffer, etc.) and the read and write indexes. If you are interested, you can refer to the relevant API documentation. Here, we will not go into detail.

There are three major types of ByteBuf in Netty:

Heap Buffer: This is the most commonly used type of ByteBuf. It stores data in the JVM heap space and is implemented by allocating an array in the JVM heap to store data. Heap buffers can be allocated quickly and easily released by the garbage collector when not in use. It also provides methods to directly access the underlying byte[] that stores the data.
Direct Buffer: A direct buffer uses off-heap memory to store data and does not occupy space in the JVM heap. When using direct buffers, you should consider the maximum memory capacity that the application needs to use and how to release it in a timely manner. Direct buffers perform well when transmitting data using sockets. However, direct buffers have drawbacks. Without JVM garbage collection management, allocating and releasing memory is more complicated compared to heap buffers. Netty mainly uses memory pools to solve such problems, which is one of the reasons why Netty uses memory pools.
Composite Buffer: We can create multiple different ByteBufs and provide a view that combines them, which is the CompositeByteBuf. It works like a list, allowing dynamic addition and removal of ByteBufs.

Memory Management #

Netty uses the ByteBuf object as the data container for I/O read and write operations. In fact, Netty’s memory management revolves around efficient allocation and release of ByteBuf objects. From the perspective of memory management, ByteBuf can be divided into two categories: Unpooled and Pooled.

Unpooled: This refers to the non-pooled memory management mode. Each time it allocates memory, it directly calls the system API to request ByteBuf from the operating system. After use, it is released through system calls. Unpooled completely delegates memory management to the system without any special handling. It is convenient to use and is a good choice for ByteBufs that are not frequently allocated and released and have low operation costs.
Pooled: This refers to the pooled memory management mode. This mode pre-allocates a large memory block to form a memory pool. When requesting ByteBuf space, it encapsulates a portion of the memory in the memory pool as ByteBufs for the service to use. After use, it is returned to the memory pool. As mentioned earlier, DirectByteBuf uses off-heap memory management, which is more complicated than heap-buffer. The pooled mode solves this problem well.

Next, we will introduce the ByteBuf pooling technology provided by Netty from three aspects: efficient memory allocation and release, reducing memory fragmentation, and reducing lock contention in multi-threaded environments.

Netty first requests a contiguous block of memory from the system, called a Chunk (default size is 16MB), which is encapsulated by the PoolChunk object. Then, Netty further splits the chunk space into Pages. Each Chunk contains 2048 Pages by default, and each Page has a size of 8KB.

In the same Chunk, Netty manages Pages in different levels of granularity. As shown in the figure below, the groups in the 1st level from the bottom have a size of 1 * PageSize, and there are a total of 2048 groups. The groups in the 2nd level have a size of 2 * PageSize, and there are a total of 1024 groups. The groups in the 3rd level have a size of 4 * PageSize, and there are a total of 512 groups, and so on, until the top level.

Drawing 7.png

1. Memory Allocation & Release #

When the service requests memory from the memory pool, Netty rounds up the requested memory size to the nearest group size and then searches for an empty group in the corresponding level from left to right. For example, if the service requests 3 * PageSize of memory, rounding up results in a group size of 4 * PageSize. Netty will find an entirely empty group in this level and allocate memory from it, as shown in the figure below:

Drawing 8.png

After allocating the memory of the group size 4 * PageSize, to facilitate subsequent memory allocation, the group is marked as fully used (indicated in red in the figure), and the coarser-grained memory groups are marked as partially used (indicated in yellow in the figure). Netty uses a completely balanced tree structure to implement the above algorithm. The underlying structure of this completely balanced tree is based on a byte array constructed as shown in the following diagram:

Drawing 9.png

I won’t go into the specific implementation logic here. If you’re interested, you can refer to the Netty source code.

2. Handling Large and Small Objects #

When a large object that exceeds the capacity of a Chunk is requested for allocation, Netty no longer uses a pooled management approach. Instead, it creates a special non-pooled PoolChunk object to manage memory allocation on a per-request basis. When the object memory is released, the entire PoolChunk is released as well.

If a certain number of spaces are much smaller than the PageSize for the ByteBuf objects, for example, creating a 256-byte ByteBuf, according to the above algorithm, a Page needs to be allocated for each small ByteBuf object, resulting in a lot of memory fragmentation. Netty solves this problem by further subdividing the Pages. The requested space size is rounded up to the nearest multiple of 16 (or a power of 2), and small Buffers smaller than PageSize can be divided into two categories:

Tiny objects: The rounded size is a multiple of 16, such as 16, 32, 48, …, 496, for a total of 31 sizes.
Small objects: The rounded size is a power of 2, such as 512, 1024, 2048, 4096, for a total of 4 sizes.

In Netty’s implementation, it first requests an idle Page from the PoolChunk, and each Page is divided into small Buffers of the same size for storage. These Pages are encapsulated by PoolSubpage objects, which internally record the size and available memory of the small Buffers it can allocate, and use a bitmap to record the usage of each small memory (as shown in the diagram below). Although this scheme cannot completely eliminate memory fragmentation, it greatly reduces memory waste.

Drawing 10.png

To address the issue of limited capacity of a single PoolChunk, Netty combines multiple PoolChunks into a linked list and uses a PoolChunkList object to hold the head of the list.

Netty manages PoolChunkList and PoolSubpage through PoolArena.

PoolArena internally holds 6 PoolChunkList objects, each with a different usage range for the PoolChunks it holds, as shown in the following diagram:

Drawing 11.png

These 6 PoolChunkList objects form a doubly-linked list. When memory allocation or release in PoolChunk causes a change in usage, it needs to determine whether the PoolChunk exceeds the usage range specified by the PoolChunkList it belongs to. If it does, it needs to find a new appropriate PoolChunkList along the doubly-linked list of the 6 PoolChunkLists to become the new head. Similarly, when a new PoolChunk allocates memory or releases space, the PoolChunk needs to be placed in the appropriate PoolChunkList according to the above logic.

Drawing 12.png

From the diagram above, we can see that the usage ranges of the 6 PoolChunkLists overlap. The reason for this design is that if a single critical value is used, when a PoolChunk is repeatedly requested and released, the memory usage will fluctuate around the critical value, causing it to move back and forth between the two linked lists of PoolChunkLists.

PoolArena holds 2 arrays of PoolSubpages, storing PoolSubpages for tiny Buffers and small Buffers respectively. PoolSubpages of the same size form linked lists, and the head nodes of the linked lists of different size PoolSubpages are stored in the tinySubpagePools or smallSubpagePools arrays, as shown in the following diagram:

Drawing 13.png

3. Concurrent Processing #

Memory allocation and release inevitably encounter multi-threaded concurrent scenarios. The tree marking of PoolChunk and the bitmap marking of PoolSubpage are both not thread-safe and require lock synchronization. To reduce thread competition, Netty creates multiple PoolArena objects in advance (the default quantity is 2 * CPU cores). When a thread first requests pooled memory allocation, it finds the PoolArena with the least number of holding threads and saves it in the thread-local variable PoolThreadCache, thereby establishing a relationship between the thread and the PoolArena.

Netty also provides lazy release functionality to improve concurrent performance. When memory is released, PoolArena does not release it immediately. Instead, it tries to store the memory-related PoolChunk and Chunk offset position information in a fixed-size cache queue stored in a ThreadLocal. If the cache queue is full, it immediately releases the memory. When a new allocation request arrives, PoolArena first checks the thread-local cache queue to see if there is available cache. If there is, it allocates directly, improving allocation efficiency.

Summary #

In this lesson, we mainly introduced the functions and principles of Netty’s core components:

First, we introduced components such as Channel, ChannelFuture, and Selector, which are the core of I/O multiplexing.
Then we introduced components such as EventLoop and EventLoopGroup, which are closely related to the main and sub Reactor thread model used by Netty.
Finally, we delved into Netty’s memory management, mainly from the perspectives of memory allocation and management, memory fragmentation optimization, and concurrent memory allocation.

Do you know any other excellent network libraries or network layer designs? Feel free to discuss by leaving a comment.