30 Practical Conclusion, Some of the Best Practices of Netty in Project Development

30 Practical conclusion, some of the best practices of Netty in project development #

This is the last lesson of the column. First of all, congratulations on your perseverance in learning so far. You are not far from becoming a Netty expert! In this lesson, I will combine my own practical experience to summarize some best practices of Netty, helping you review the knowledge points from previous lessons and further improve your advanced skills in Netty.

In this lesson, we will present the content in the form of a list of knowledge points, focusing only on the core points of Netty. More detailed implementation principles need to be studied by examining the source code after class.

Performance #

Network Parameter Optimization #

Netty provides ChannelOption to optimize TCP parameter configuration. In order to improve the throughput of network communication, it is necessary for us to master some optional network parameters. In the previous lessons, we have introduced some commonly used parameters. This time, we will further expand on them.

SO_SNDBUF/SO_RCVBUF

The size of the TCP send buffer and receive buffer. In order to achieve maximum network throughput, SO_SNDBUF should not be smaller than the product of bandwidth and delay. SO_RCVBUF will continue to hold data until it is read by the application process. If SO_RCVBUF is full, the receiving end will notify the remote TCP protocol to close the window, ensuring that SO_RCVBUF will not overflow.

The size setting of SO_SNDBUF/SO_RCVBUF is recommended to be based on the average size of messages, not the maximum message size. Setting it based on the maximum message size will result in additional memory waste. A more flexible way is to dynamically adjust the size of the buffer. This is where the advantage of ByteBuf comes in. Netty’s ByteBuf supports dynamic capacity adjustment, and provides ready-to-use tools, such as the AdaptiveRecvByteBufAllocator, which dynamically adjusts the capacity of the receive buffer.

TCP_NODELAY

Whether to enable the Nagle algorithm. The Nagle algorithm accumulates network packets in a cache and sends them only when a certain amount has accumulated, in order to avoid frequent transmission of small packets. The Nagle algorithm is very effective in scenarios with a large amount of traffic, but it can cause some data delay. If you are sensitive to data transmission latency, you should disable this parameter.

SO_BACKLOG

The maximum length of the request queue for completed three-way handshakes. The server may handle multiple connections at the same time. In high-concurrency scenarios with a large number of connections, this parameter should be appropriately increased. However, SO_BACKLOG should not be too large, otherwise it may not prevent SYN-Flood attacks.

SO_KEEPALIVE

Connection keep-alive. When the TCP SO_KEEPALIVE attribute is enabled, TCP will actively detect the connection status. Linux has set a default heartbeat frequency of 2 hours. The TCP keep-alive mechanism is mainly used to recycle connections that have been idle for a long time, and is not suitable for high real-time scenarios.

In high-connection scenarios, you may encounter error messages such as “too many open files.” Therefore, it is necessary to optimize the maximum file handle number on the Linux operating system. You can achieve this by adding the following configuration in the file /etc/security/limits.conf:

* soft nofile 1000000

* hard nofile 1000000

After making the modifications, execute the sysctl -p command to make the configuration take effect. Then, use the ulimit -a command to check if the parameters have taken effect.

The Necessity of a Business Thread Pool #

Netty is implemented based on the Reactor thread model. The number of I/O threads is fixed and resources are precious. The ChannelPipeline is responsible for the propagation of all events. If any ChannelHandler needs to perform time-consuming operations, the I/O threads will be blocked, and the entire system may be overwhelmed. Therefore, it is recommended to create a custom business thread pool in the ChannelHandler to handle time-consuming operations. Taking an RPC framework as an example, when the service provider processes an RPC request call, the RPC request is submitted to the custom business thread pool for execution, as shown below:

public class RpcRequestHandler extends SimpleChannelInboundHandler<MiniRpcProtocol<MiniRpcRequest>> {

    @Override
    protected void channelRead0(ChannelHandlerContext ctx, MiniRpcProtocol<MiniRpcRequest> protocol) {
        RpcRequestProcessor.submitRequest(() -> {
            // Handle the RPC request
        });
    }
}

We often use the new HandlerXXX() method to initialize channels. For each new connection established, new instances of HandlerA and HandlerB will be initialized. If the system hosts 10,000 connections, then 20,000 handlers will be initialized, causing a significant waste of memory.

ServerBootstrap b = new ServerBootstrap();
b.group(bossGroup, workerGroup)
    .channel(NioServerSocketChannel.class)
    .localAddress(new InetSocketAddress(port))
    .childHandler(new ChannelInitializer<SocketChannel>() {
        @Override
        public void initChannel(SocketChannel ch) {
            ch.pipeline()
                .addLast(new HandlerA())
                .addLast(new HandlerB());
        }
    });

To solve the above problem, Netty provides the @Sharable annotation to decorate ChannelHandler, indicating that there is only one instance of this ChannelHandler globally, and it will be shared by multiple ChannelPipelines. Therefore, it is important to note that the ChannelHandler decorated with @Sharable must be stateless in order to ensure thread safety.

Setting High and Low Watermarks #

The high and low watermarks, WRITE_BUFFER_HIGH_WATER_MARK and WRITE_BUFFER_LOW_WATER_MARK, are two very important flow control parameters. When Netty adds data, it accumulates the number of bytes of the data and determines whether the size of the cache exceeds the set high watermark. If the high watermark is exceeded, the Channel will be set to an unwritable state. The Channel will only return to a writable state when the amount of cached data is below the low watermark. The default high and low watermarks in Netty are 32KB to 64KB. You can reasonably set the high and low watermarks based on the actual situation of the sender and receiver. If you do not have enough test data as a reference, it is recommended not to change the high and low watermarks arbitrarily. The setting of the high and low watermarks is done as follows:

// Server

ServerBootstrap bootstrap = new ServerBootstrap();

bootstrap.childOption(ChannelOption.WRITE_BUFFER_HIGH_WATER_MARK, 32 * 1024);

bootstrap.childOption(ChannelOption.WRITE_BUFFER_LOW_WATER_MARK, 8 * 1024);

// Client

Bootstrap bootstrap = new Bootstrap();

bootstrap.option(ChannelOption.WRITE_BUFFER_HIGH_WATER_MARK, 32 * 1024);

bootstrap.option(ChannelOption.WRITE_BUFFER_LOW_WATER_MARK, 8 * 1024);


When the cache exceeds the high watermark, the Channel is set to the unwritable state, and calling the `isWritable()` method will return false. It is recommended to use the `isWritable()` method to check the cache level before writing data to prevent OOM caused by slow processing on the receiving end. The recommended usage is as follows:

```java
if (ctx.channel().isActive() && ctx.channel().isWritable()) {

    ctx.writeAndFlush(message);

} else {

    // handle message

}

GC Optimization #

Optimizing JVM parameters for network applications in different scenarios can greatly improve performance and avoid OOM risks. However, the characteristics of different business systems differ. Here are some important points to consider.

Heap Memory: -Xms and -Xmx parameters control the maximum size of JVM Heap. Setting a reasonable value for -Xmx helps reduce GC overhead and improve system throughput. -Xms represents the initial value of the JVM Heap. For production servers, it is recommended to set both -Xms and -Xmx to the same value.
Off-Heap Memory: DirectByteBuffer is most likely to cause OOM. The recycling of DirectByteBuffer objects relies on Old GC or Full GC to trigger cleaning. If no Old GC or Full GC is performed for a long time, even if off-heap memory is no longer used, it will continue to occupy memory without releasing it. It is best to use the JVM parameter -XX:MaxDirectMemorySize to set the upper limit of off-heap memory. When the off-heap memory size exceeds this threshold, a Full GC will be triggered for cleaning and recycling. If the allocation of off-heap memory cannot be satisfied even after a Full GC, the program will throw an OOM exception.
Young Generation: -Xmn adjusts the size of the young generation, and -XX:SurvivorRatio sets the SurvivorRatio and Eden ratio. Frequent Young GCs are often encountered. It is important to have a clear understanding of the basic distribution of objects in the program. If there are a large number of short-lived objects, it is recommended to increase the size of the young generation; otherwise, increase the size of the old generation. For example, in scenarios such as millions of long-lived connections and push services with low latency sensitivity, optimizing the size of the young generation and the proportions of each area can bring greater benefits.

Memory Pooling & Object Pooling #

From the perspective of memory allocation, ByteBuf can be divided into heap memory (HeapByteBuf) and off-heap memory (DirectByteBuf). Although the allocation and deallocation efficiency of DirectByteBuf is slower compared to HeapByteBuf, it can avoid an extra memory copy during socket read and write processes, resulting in better performance.

To reduce the frequent creation and destruction of off-heap memory, Netty provides a pooled type called PooledDirectByteBuf. Netty reserves a continuous block of memory as the ByteBuf memory pool. If there is a need for off-heap memory allocation, it can be directly obtained from the memory pool. After use, it must be returned to the memory pool, otherwise it will cause severe memory leaks. Enabling memory pooling in Netty can be specified when creating the client or server, as shown in the example code:

bootstrap.option(ChannelOption.ALLOCATOR, PooledByteBufAllocator.DEFAULT);

bootstrap.childOption(ChannelOption.ALLOCATOR, PooledByteBufAllocator.DEFAULT);

Both object pooling and memory pooling aim to improve the concurrent processing capacity of Netty. In project development, we often cache some common objects. When we need these objects, we prefer to get them from the object pool. By reusing objects, we avoid the performance loss caused by frequent creation and destruction, and it is also friendly to JVM GC. If you are working on a high-performance network application system, you may consider trying the Recycler object pool provided by Netty. The usage of the Recycler object pool has been introduced in previous lessons, and let’s review it here together. Assuming we have a User class that needs to reuse User objects, the implementation code is as follows:

public class UserCache {

    private static final Recycler<User> userRecycler = new Recycler<User>() {

        @Override

        protected User newObject(Handle<User> handle) {

            return new User(handle);

        }

    };

    static final class User {

        private String name;

        private Recycler.Handle<User> handle;

        public void setName(String name) {

            this.name = name;

        }

        public String getName() {

            return name;

        }

        public User(Recycler.Handle<User> handle) {

            this.handle = handle;

        }

        public void recycle() {

            handle.recycle(this);

        }

    }

    public static void main(String[] args) {

User user1 = userRecycler.get(); // 1. Get User object from object pool

user1.setName("hello"); // 2. Set properties of User object

user1.recycle(); // 3. Recycle object to object pool

User user2 = userRecycler.get(); // 4. Get object from object pool

System.out.println(user2.getName());

System.out.println(user1 == user2);

}

From this code, it can be seen that the core goal of Netty’s memory pool and Recycler object pool optimization is to reduce the overhead of resource allocation, avoiding excessive memory consumption and GC pressure caused by a large number of transient objects. You can review the principles of memory pool and object pool in the previous courses “Netty High-Performance Memory Management Design” and “Lightweight Object Recycling: Recycler Object Pool Technology” to further understand and digest them.

Native Support #

Starting from version 4.0.16, Netty provides Socket Transport written in C++ through JNI invocation. It has higher performance and lower GC costs compared to JDK NIO, and supports more TCP parameters.

<dependency>    
    <groupId>io.netty</groupId>    
    <artifactId>netty-transport-native-epoll</artifactId>    
    <version>4.1.42.Final</version>    
</dependency>

Using Netty Native is very simple, you just need to replace the corresponding classes:

图片1.png

Thread Binding #

If you often focus on system performance tuning, you must have explored the dark art of CPU affinity on Linux operating systems. CPU affinity means that threads on a multi-core CPU machine can be forced to run on a specific CPU and will not be scheduled to other CPUs. It is also known as core binding. When a thread is bound to a specific CPU, not only can the cost of CPU switching be avoided, but the CPU cache hit rate can also be improved, resulting in improved system performance.

Implementing core binding in C/C++ or Golang is very easy, but unfortunately it is more cumbersome in Java. Currently, there is an open-source affinity library in Java, with its GitHub address being https://github.com/OpenHFT/Java-Thread-Affinity. If you want to introduce and use it in your project, you need to first add the Maven dependency:

<dependency>    
    <groupId>net.openhft</groupId>    
    <artifactId>affinity</artifactId>    
    <version>3.0.6</version>    
</dependency>

The affinity library can be easily integrated with Netty. One common way is to create an AffinityThreadFactory and pass it to the EventLoopGroup. The AffinityThreadFactory is responsible for creating worker threads and binding them to specific cores. The implementation code is as follows:

EventLoopGroup bossGroup = new NioEventLoopGroup(1);

ThreadFactory threadFactory = new AffinityThreadFactory("worker", AffinityStrategies.DIFFERENT_CORE);

EventLoopGroup workerGroup = new NioEventLoopGroup(4, threadFactory);

ServerBootstrap serverBootstrap = new ServerBootstrap().group(bossGroup, workerGroup);

High Availability #

Connection Idle Detection + Heartbeat Detection #

Connection idle detection refers to periodically checking whether the connection has any data read or write activity. If the server keeps receiving data sent by the client, it means the connection is active. For dead connections, no data sent by the remote end can be received. If no data sent by the client is received within a certain period of time, it cannot be concluded that the connection is in a dead state. It is possible that the client simply doesn’t have any data to send for a long time, but the established connection is still healthy. Therefore, the server also needs to use the mechanism of heartbeat detection to determine whether the client is alive.

The client can send a heartbeat packet to the server periodically. If no heartbeat data is received for N consecutive times, it can be concluded that the client is offline or in an unhealthy state. Therefore, connection idle detection and heartbeat detection are effective means to handle dead connections. Usually, the interval for idle detection should be greater than the interval for heartbeat detection of 2 cycles, mainly to exclude the possibility of network jitter preventing the successful receipt of heartbeat packets.

TCP already has the SO_KEEPALIVE parameter. Why do we need to add a heartbeat mechanism at the application layer? The heartbeat mechanism not only indicates that the application is in an active state, but more importantly, it can determine whether the application is still working normally. However, TCP KEEPALIVE has serious flaws. The design purpose of KEEPALIVE is to clear and recycle connections in a dead state, but it is not real-time. KEEPALIVE can only check whether the connection is active, but cannot determine whether the connection is available. For example, if the server is in a high-load and dead state, but the connection is still active.

Decoder Protection #

When implementing data decoding in Netty, it needs to wait until there are enough bytes in the buffer to start decoding. In order to avoid the buffer caching too much data and causing memory exhaustion, we can set a maximum byte threshold in the decoder and notify other ChannelHandlers in the ChannelPipeline through the provided TooLongFrameException exception. Here is an example:

public class MyDecoder extends ByteToMessageDecoder {

    private static final int MAX_FRAME_LIMIT = 1024;

    @Override

    public void decode(ChannelHandlerContext ctx, ByteBuf in, List<Object> out) {

        int readable = in.readableBytes();

        if (readable > MAX_FRAME_LIMIT) {

            in.skipBytes(readable);

            throw new TooLongFrameException("too long frame");

        }

        // decode

    }
}

Check whether the readable bytes in the buffer are greater than MAX_FRAME_LIMIT. If they exceed, ignore these readable bytes. This is an effective protection measure for application in specific scenarios.

Thread Pool Isolation #

As we know, if there is complex and time-consuming business logic, it is recommended to customize a new business thread pool in the ChannelHandler processor to submit the time-consuming operations for execution. It is recommended to split multiple business thread pools based on the core level of the business logic. If a certain type of business logic causes the exhaustion of the thread pool resources, it will not affect other business logics, thereby improving the overall availability of the application. For Netty I/O threads, each EventLoop can be bound to a certain type of business thread pool to avoid multi-thread lock contention. The following diagram shows the configuration:

图片2.png

Traffic Shaping #

Traffic shaping is a measure to actively control the output rate of service traffic, ensuring the downstream services can handle it smoothly. The difference between traffic shaping and flow control is that traffic shaping will not discard or reject messages. Regardless of the size of the traffic peak, it will control the traffic using the token bucket algorithm to output at a constant rate, as shown in the following diagram.

Drawing 1.png

Netty provides three types of traffic shaping strategies by implementing the abstract class AbstractTrafficShapingHandler: GlobalTrafficShapingHandler, ChannelTrafficShapingHandler, and GlobalChannelTrafficShapingHandler. Their relationships are as follows:

GlobalTrafficShapingHandler = ChannelTrafficShapingHandler + GlobalChannelTrafficShapingHandler

The global traffic shaping handler GlobalChannelTrafficShapingHandler operates on all channels, and users can set the receiving rate, sending rate, and shaping period for global packets. The channel-level traffic shaping handler ChannelTrafficShapingHandler operates on individual channels and can set traffic shaping strategies for different channels. For example, a popular tourist attraction not only limits the number of visitors at the entrance (equivalent to GlobalChannelTrafficShapingHandler), but also limits the number of visitors at different small attractions inside the attraction (equivalent to ChannelTrafficShapingHandler). The combination of these two traffic shaping strategies is the GlobalTrafficShapingHandler.

Traffic shaping alone cannot guarantee the system to be in a safe state. When the traffic peak is too high, the data will continue to accumulate in memory. Therefore, traffic shaping and flow control should be used together to ensure high availability of the system.

Troubleshooting Heap Memory Leak #

Heap memory leak is a hot issue for Netty applications, which often encounter situations where the Java process consumes a high amount of memory but the heap memory is not high. Here are some basic ideas for troubleshooting heap memory leaks:

Heap Memory Recovery #

Manually trigger a FullGC with jmap -histo:live <pid> and observe whether the off-heap memory has been recovered. If it is normally recovered, it is likely because the off-heap memory setting is too small and can be adjusted by using -XX:MaxDirectMemorySize. However, this cannot rule out the case of slow off-heap memory leak, which requires analysis with other tools.

Monitoring Off-Heap Memory in Code #

In previous lectures, we introduced the principle of off-heap memory recovery. It is recommended to review it again. The JDK uses the Cleaner to release and recover the DirectByteBuffer. The Cleaner inherits from PhantomReference and relies on GC for processing, so the recovery time is unpredictable. For DirectByteBuffers with hasCleaner, Java provides a series of MXBeans of different types for accessing JVM process thread, memory, and other monitoring indicators. The code implementation is as follows:

BufferPoolMXBean directBufferPoolMXBean = ManagementFactory.getPlatformMXBeans(BufferPoolMXBean.class).get(0);

LOGGER.info("DirectBuffer count: {}, MemoryUsed: {} K", directBufferPoolMXBean.getCount(), directBufferPoolMXBean.getMemoryUsed() / 1024);

For noCleaner DirectByteBuffers in Netty, you can directly read them using PlatformDependent.usedDirectMemory().

Netty’s Built-in Detection Tool #

Netty provides a built-in memory leak detection tool. We can enable it by using the following command:

-Dio.netty.leakDetection.level=paranoid

Netty provides four detection levels:
1. disabled - disables off-heap memory leak detection
2. simple - performs off-heap memory leak detection with a 1% sampling rate, consumes fewer resources, and is the default detection level
3. advanced - performs off-heap memory leak detection with a 1% sampling rate and provides detailed memory leak reports
4. paranoid - tracks all usage of off-heap memory and provides detailed memory leak reports, this is the highest detection level and comes with a higher performance cost, commonly used for local debugging and issue troubleshooting.

Netty checks whether a ByteBuf is unreachable and has a reference count greater than 0 to determine the location of a memory leak, and outputs it to the log. You should pay attention to the keyword “LEAK” in the log.

MemoryAnalyzer Memory Analysis #

We can troubleshoot off-heap memory leak issues by using traditional memory dumps. Run the following command:

jmap -dump:format=b,file=heap.dump pid

After dumping the memory stack, import it into the MemoryAnalyzer tool to analyze the suspicious points of memory leaks and ultimately locate the source code. I won’t go into the details of how to use the MemoryAnalyzer tool, as it is something you need to learn and explore on your own. It is an essential skill for every Java programmer.

BTrace #

BTrace is a troubleshooting tool that detects Java programs through bytecode inspection. It can capture all information about a program during runtime and is similar to the usage of AOP. To trace the source of off-heap memory allocations for DirectByteBuffer, you can use the following method:

@BTrace
 
public class TraceDirectAlloc {
 
    @OnMethod(clazz = "java.nio.Bits", method = "reserveMemory")
 
    public static void printThreadStack() {
 
        jstack();
 
    }
}

Binary Search Method: A Simple Solution to Complex Problems #

Off-heap memory leaks can sometimes be very subtle and not easy to locate and detect. To improve the efficiency of problem troubleshooting, it’s best to be able to reproduce off-heap memory leak issues locally. If you can successfully reproduce the issue locally, you are already halfway to solving it.

You can use a binary search approach by rolling back the code based on recent code changes and then retesting to see if the off-heap memory leak issue can be reproduced. Eventually, you can identify the problematic code commit. Although this approach may seem simple, it can effectively solve problems in many scenarios.

Conclusion #

The techniques mentioned above are important in practical projects and are sufficient for getting started with Netty application development. There are still many more tricks and insights about using Netty that we need to explore in our own practice. Knowledge gained purely from studying is always superficial. To truly understand something, we need to engage and accumulate practical experience. As you accumulate rich experience, whether it is project development or problem troubleshooting, you will become more proficient and skillful.