11 Q& a Session Deep Dive Into the Optimization Principles of Nio

11 Q&A session Deep dive into the optimization principles of NIO #

Hello, I’m Liu Chao. The column has been online for more than 20 days now. First of all, I would like to thank everyone for their active comments. The process of interaction has also been rewarding for me.

After reviewing the recent comments, I decided to create the first lesson of the Q&A class. I will continue to explain I/O optimization, supplement the content mentioned in Lesson 08, and extend some knowledge points related to I/O. I will also share more practical scenarios. Without further ado, let’s get right into it.

One tuning that is frequently mentioned in Tomcat is modifying the I/O model of threads. Prior to version 8.5, Tomcat uses BIO (Blocking I/O) as the default thread model. In high load and high concurrency scenarios, the NIO (Non-blocking I/O) thread model can be used to improve the system’s network communication performance.

We can see the performance comparison between BIO and NIO communication under high load or high concurrency scenarios through a performance test (here, we use page requests to simulate multiple I/O read/write operation requests):

Test results: In situations where Tomcat has a large number of I/O read/write operations, using the NIO thread model brings obvious advantages.

In Tomcat, a seemingly simple configuration actually involves a lot of optimization and upgrade knowledge points. Next, we will analyze in depth how communication frameworks such as Tomcat and Netty improve system performance by optimizing I/O, starting from optimizing the underlying network I/O model, moving on to memory copy optimization and thread model optimization.

Optimization of Network I/O Models #

In network communication, the lowest level is the network I/O model in the kernel. With the development of technology, the network model in the operating system kernel has evolved into five I/O models. “Unix Network Programming” categorizes these five I/O models as blocking I/O, non-blocking I/O, I/O multiplexing, signal-driven I/O, and asynchronous I/O. The emergence of each I/O model is an optimization and upgrade based on the previous I/O model.

The original blocking I/O requires a user thread to handle each connection creation, and when an I/O operation is not ready or complete, the thread is suspended and enters a blocking wait state. Blocking I/O becomes the root cause of performance bottlenecks.

So where does the blocking occur in socket communication?

In “Unix Network Programming,” socket communication can be divided into stream sockets (TCP) and datagram sockets (UDP). TCP connections are the most commonly used. Let’s take a look at the working process of a TCP server (assuming the simplest TCP data transfer here):

First, the application program creates a socket using the system call socket, which is a file descriptor allocated by the system to the application program.
Then, the application program binds an address and port number to the socket using the system call bind to give the socket a name.
Next, the system creates a queue to store incoming connections using the system call listen.
Finally, the application server listens for client connection requests using the system call accept.

When a client connects to the server, the server creates a child process using the system call fork, and the child process listens for messages sent by the client using the system call read and returns information to the client using write.

1. Blocking I/O #

In the entire socket communication workflow, the socket is in a blocking state by default. That is, when a socket call that cannot be completed immediately is made, the process is blocked, suspended by the system, and enters a sleep state, waiting for a corresponding operation response. From the above diagram, we can see that there may be three types of blocking.

Blocking connect: When a client initiates a TCP connection request and uses the system call connect, the establishment of a TCP connection requires a three-way handshake process. The client needs to wait for the ACK and SYN signals sent back by the server, and the server also needs to wait for the client’s confirmation of the connection through the ACK signal. This means that each connect in TCP will block until the connection is confirmed.

Blocking accept: When a blocking socket communication server receives an incoming connection, it calls the accept function. If no new connections arrive, the calling process will be suspended and enter a blocking state.

Blocking read and write: After a socket connection is successfully created, the server creates a child process using the fork function and calls the read function to wait for the client’s data to be written. If there is no data to be written, the child process is suspended and enters a blocking state.

2. Non-blocking I/O #

Using the fcntl function, the above three operations can be set as non-blocking operations. If there is no data returned, an EWOULDBLOCK or EAGAIN error will be returned, and the process will not be blocked indefinitely.

When we set the above operations to non-blocking, we need to have a thread polling and checking the status of the operations. This is the most traditional non-blocking I/O model.

3. I/O Multiplexing #

If we use a user thread to poll and check the status of an I/O operation, in the case of a large number of requests, this will undoubtedly be a disaster for CPU utilization. Is there another way to implement non-blocking I/O sockets?

Linux provides I/O multiplexing functions select, poll, and epoll. The process blocks on the function operation by calling one or more read operations through system call functions. In this way, the system kernel can help us detect whether multiple read operations are ready.

select function: Its purpose is to monitor the occurrence of readable, writable, and exceptional events on file descriptors that the user is interested in within a timeout period. The Linux operating system treats all external devices as files and operates on them. Read and write operations on a file call system commands provided by the kernel and return a file descriptor (fd).

int select(int maxfdp1, fd_set *readset, fd_set *writeset, fd_set *exceptset, const struct timeval *timeout);

In the above code, the file descriptors monitored by the select function are divided into three categories: writefds (write file descriptors), readfds (read file descriptors), and exceptfds (exceptional event file descriptors).

After calling the select function, it will block until a descriptor is ready or a timeout occurs, and then the function returns. After the select function returns, you can use the FD_ISSET function to traverse the fd_set to find the ready descriptors. fd_set can be understood as a set that contains file descriptors, and the following four macros can be used to set it:

void FD_ZERO(fd_set *fdset);          // Clears the set
void FD_SET(int fd, fd_set *fdset);   // Adds a given file descriptor to the set
void FD_CLR(int fd, fd_set *fdset);   // Deletes a given file descriptor from the set

int FD_ISSET(int fd, fd_set *fdset); // Check if a specified file descriptor can be read or written

poll() function: Before each call to the select() function, the system needs to copy an fd from user space to kernel space, which adds a certain amount of performance overhead to the system. The default number of fds a single process can monitor is 1024, but this limit can be broken by modifying macros or even recompiling the kernel. However, since fd_set is implemented based on an array, efficiency will decrease when the number of fds to be added or deleted is large.

The mechanism of poll() is similar to that of select(). Both manage multiple descriptors through polling and process them based on the status of the descriptors. However, with poll(), there is no limit on the maximum number of file descriptors.

Both poll() and select() have a common drawback, which is that the array containing a large number of file descriptors is copied as a whole between user space and kernel space. The overhead will increase linearly with the increase of the number of file descriptors, regardless of whether these descriptors are ready.

epoll() function: select/poll scans fds sequentially, and the supported number of fds should not be too large. Therefore, their usage is limited.

Linux provides an epoll call starting from kernel version 2.6. epoll uses an event-driven approach instead of polling to scan fds. epoll registers a file descriptor in advance through epoll_ctl() and stores the file descriptor in an event table in the kernel. This event table is implemented based on a red-black tree. Therefore, in scenarios with a large number of I/O requests, the performance of inserting and deleting is better than that of select/poll’s array fd_set. Therefore, epoll has better performance and is not limited by the number of fds.

int epoll_ctl(int epfd, int op, int fd, struct epoll_event event)

Through the code above, we can see that epfd in epoll_ctl() is an epoll dedicated file descriptor generated by epoll_create(). op represents the type of operation event, fd represents the associated file descriptor, and event represents the specified event type to be listened to.

Once a file descriptor becomes ready, the kernel will quickly activate this file descriptor using a callback mechanism similar to callback. When the process calls epoll_wait(), it will receive a notification, and then the process will complete the related I/O operation.

int epoll_wait(int epfd, struct epoll_event events,int maxevents,int timeout)

### 4. Signal-Driven I/O

Signal-Driven I/O is similar to the Observer pattern, where the kernel acts as an observer and the signal callback serves as the notification. When a user process initiates an I/O request operation, it registers a signal callback for the corresponding socket using the sigaction system call. At this point, the user process is not blocked and continues to work. When the kernel data is ready, the kernel generates a SIGIO signal for that process and notifies it through the signal callback to perform the relevant I/O operation.

![img](../images/4c897ce649d3deb609a4f0a9ba7aee1f.png)

Compared to the first three I/O models, Signal-Driven I/O achieves better performance by allowing the process to continue working while waiting for data to be ready and not blocking the main loop.

However, Signal-Driven I/O is rarely used for TCP communications because the SIGIO signal is a Unix signal without additional information. If a signal source has multiple causes for generating signals, the signal receiver cannot determine what has actually happened. TCP sockets can produce up to seven different signal events, so when an application receives a SIGIO signal, it cannot differentiate how to handle it.

But Signal-Driven I/O is used in UDP communication. As we can see from the diagram of UDP communication in Lecture 10, UDP only has one data request event. This means, under normal circumstances, a UDP process only needs to capture the SIGIO signal and call recvfrom to read the arriving datagram. If an exception occurs, it returns an error. For example, the NTP server applies this model.

### 5. Asynchronous I/O

Although Signal-Driven I/O does not block the process while waiting for data to be ready, the I/O operation performed after being notified is still blocking. The process has to wait for the data to be copied from the kernel space to the user space. Asynchronous I/O, on the other hand, achieves true non-blocking I/O.

When a user process initiates an I/O request operation, the system informs the kernel to start a certain operation and notify the process after completing the entire operation. This operation includes waiting for the data to be ready and copying the data from the kernel to the user space. As the program's code complexity is high and debugging is difficult, and there are few operating systems that support asynchronous I/O (Linux currently does not support it, while Windows has implemented asynchronous I/O), so asynchronous I/O models are rarely used in practical production environments.

![img](../images/dd4b03afb56a3b7660794ce11fc421c0.png)

In Lecture 08, I talked about using NIO and the Selector I/O multiplexer to implement non-blocking I/O. Selector actually uses the I/O multiplexing models among these five types. The Selector class in Java is a wrapper class for select/poll/epoll.

As mentioned in the TCP communication process earlier, the blocking operations in socket communication, such as connect, accept, read, and write, correspond to the four listening events of SelectionKey in Selector: OP_ACCEPT, OP_CONNECT, OP_READ, and OP_WRITE.

![img](../images/85bcacec92e74c5cb6d6a39669e0d896.png)

In NIO server communication programming, a Channel is created to listen for client connections. Then, a multiplexer (Selector) is created, and the Channel is registered with the Selector. The program will poll the Channels registered with the Selector through the Selector. When one or more Channels are in a ready state, it returns the ready listening events, and finally, the program matches the listening events and performs the relevant I/O operations.

![img](../images/e27534c5a157d0908d51b806919b1515.jpg)

When creating a Selector, the program selects which I/O multiplexing function to use based on the operating system version. In JDK 1.5, if the program runs on a Linux operating system and the kernel version is above 2.6, NIO will choose epoll instead of the traditional select/poll, greatly improving the performance of NIO communication.

Due to the lack of support for TCP communication in Signal-Driven I/O and the immature application of asynchronous I/O in the Linux operating system kernel, most frameworks still implement network communication based on I/O multiplexing models.
## Zero Copy

In the I/O multiplexing model, performing read and write I/O operations is still blocking. During read and write I/O operations, there are multiple memory copies and context switches, which adds performance overhead to the system.

Zero copy is a technique used to avoid multiple memory copies and optimize read and write I/O operations.

In network programming, read and write are usually used to perform I/O operations. Each I/O operation requires four memory copies: I/O device -> kernel space -> user space -> kernel space -> other I/O device.

The mmap function in the Linux kernel can replace the read and write I/O operations, allowing the user space and kernel space to share a cache of data. mmap maps a block of memory address in the user space and a block of memory address in the kernel space to the same physical memory address. Whether it is in the user space or the kernel space, it is a virtual address that ultimately needs to be mapped to a physical memory address. This approach avoids data exchange between the kernel space and the user space. The epoll function in I/O multiplexing uses mmap to reduce memory copies.

In Java NIO programming, Direct Buffer is used to achieve zero copy of memory. Java directly allocates a physical memory space outside the JVM memory space, so that both the kernel and the user process can share a cache of data. This has been explained in detail in Lecture 08 and you can review it again.
## Thread Model Optimization

In addition to the kernel's optimization of the network I/O model, NIO has also been optimized and upgraded at the user level. NIO is implemented based on an event-driven model for I/O operations. The Reactor model is a common model for synchronous I/O event handling. Its core idea is to register I/O events with a multiplexer. Once an I/O event is triggered, the multiplexer will distribute the event to an event handler and execute the ready I/O event operation. **This model has the following three main components:**

  * Event Acceptor: primarily responsible for accepting connection requests;
  * Event Reactor: after receiving a request, it will register the established connection with the reactor and rely on continuously monitoring the multiplexer Selector. Once an event is detected, it will dispatch the event to an event handler;
  * Event Handlers: event handlers are responsible for performing relevant event processing, such as read and write I/O operations.



### 1. Single-threaded Reactor Thread Model

At the beginning, NIO was implemented based on a single thread, and all I/O operations were performed on a single NIO thread. Since NIO is non-blocking I/O, theoretically, a single thread can complete all I/O operations.

However, NIO has not really achieved non-blocking I/O operations, because the user process is still in a blocked state during read and write I/O operations. This approach will face performance bottlenecks in high-load and high-concurrency scenarios. If a single NIO thread simultaneously handles I/O operations for tens of thousands of connections, the system will not be able to support such a level of requests.

![img](../images/29b117cb60bcc8edd6d1fa21981fb9f8.png)

### 2. Multi-threaded Reactor Thread Model

To solve the performance bottleneck of the single-threaded NIO in high-load and high-concurrency scenarios, a thread pool was later used.

In Tomcat and Netty, an Acceptor thread is used to listen for connection requests. Once a connection is established, it will be registered with the multiplexer and any events detected will be handed over to the Worker thread pool for processing. In most cases, this thread model can meet performance requirements, but if the number of connecting clients is at another level, an Acceptor thread may encounter performance bottlenecks.

![img](../images/0de0c467036f2973143a620448068a82.png)

### 3. Master-Slave Reactor Thread Model

Currently, the NIO communication frameworks in mainstream frameworks are all implemented based on the Master-Slave Reactor thread model. In this model, the Acceptor is no longer a separate NIO thread, but a thread pool. After the Acceptor receives a TCP connection request from a client, subsequent I/O operations are handed over to Worker I/O threads.

![img](../images/f9d03620ae5c7c82c83f522710a62a0a.png)

### Tuning Tomcat Based on Thread Model

In Tomcat, the BIO and NIO models are implemented based on the Master-Slave Reactor thread model.

**In BIO,** the Acceptor in Tomcat is only responsible for listening for new connections. Once a connection is established and an I/O operation is detected, it will be handed over to the Worker thread for I/O read and write operations.

**In NIO,** Tomcat introduces a Poller thread pool. After the Acceptor detects a connection, instead of directly using the Worker threads to process the request, it sends the request to the Poller buffer queue first. In the Poller, there is a Selector object that maintains registration of connections retrieved from the queue. Then, the Poller traverses the Selector object to find any ready I/O operations and uses threads from the Worker thread pool to process the corresponding requests.

![img](../images/136315be52782bd88056fb28f3ec60ba.png)

You can set the following parameters to configure the Acceptor thread pool and Worker thread pool.

**acceptorThreadCount:** This parameter represents the number of Acceptor threads. In cases where the data volume from client requests is extremely large, you can increase this thread count appropriately to improve the ability to handle request connections. The default value is 1.

**maxThreads:** This parameter represents the number of Worker threads dedicated to I/O operations. The default value is 200. You can adjust this parameter based on the actual environment, but a larger value is not necessarily better.

**acceptCount:** In Tomcat, the Acceptor thread is responsible for fetching the connection from the accept queue and then handing it over to the working thread to execute relevant operations. The acceptCount refers to the size of the accept queue.

When HTTP keep-alive is disabled and the concurrency is relatively high, you can increase this value appropriately. However, when HTTP keep-alive is enabled, because the number of Worker threads is limited, the Worker threads may be occupied for a long time, and connections in the accept queue may time out waiting. If the accept queue is too large, it may waste connections.

**maxConnections:** This parameter represents the number of socket connections to Tomcat. In the BIO mode, one thread can only handle one connection, so generally, maxConnections is set to the same value as maxThreads. In the NIO mode, one thread can handle multiple connections simultaneously, so maxConnections should be set much larger than maxThreads. The default value is 10000.