09 When to Use Select, Poll, Epoll in Redis Event Driven Framework

09 When to Use select, poll, epoll in Redis Event-Driven Framework #

As a client-server architecture database, Redis naturally includes a part that implements network communication in its source code. And as you may be aware, the typical method for implementing network communication in a system is through socket programming, which involves creating a socket, listening to a port, handling connection requests, and reading and writing requests. However, since the basic socket programming model can only handle requests from one client connection at a time, when dealing with high-concurrency requests, one solution is to use multiple threads, with each thread responsible for handling requests from one client.

However, Redis only has one thread responsible for parsing and handling client requests. So, using the basic socket model directly would affect Redis’s support for high-concurrency client access.

Therefore, in order to achieve high-concurrency network communication, the Linux operating system, commonly used by Redis, provides three programming models: select, poll, and epoll. When running on Linux, Redis typically uses the epoll model for network communication.

Now you might be wondering: Why does Redis typically choose the epoll model? What are the differences between these three programming models? And if we were to develop a high-concurrency server processing program on our own, how should we choose which model to use?

In today’s lesson, I will discuss Redis’s choice and design principles regarding high-concurrency network communication programming models. Through this lesson, you will learn the working mechanisms and usage of the select, poll, and epoll models. Understanding these details will not only help you grasp the foundational working principles of Redis’s overall network communication framework, but also teach you how to develop high-concurrency network communication programs.

To understand the advantages of select, poll, and epoll, we need a basis for comparison, which is the basic socket programming model. So next, let’s first understand the basics of the socket programming model, as well as its limitations.

Why doesn’t Redis use the basic Socket programming model? #

Just now we mentioned that when implementing network communication using the Socket model, multiple steps such as creating a Socket, listening on a port, handling connections, and handling read/write requests are required. Now let’s specifically understand the key operations in these steps to help us analyze the shortcomings of the Socket model.

First, when we need to communicate between the server and the client, these three steps can be taken at the server to create a listening socket for client connections:

  1. Call the socket function to create a socket. We usually refer to this socket as an active socket.
  2. Call the bind function to bind the active socket to the IP and listening port of the current server.
  3. Call the listen function to convert the active socket into a listening socket and start listening for client connections.

After completing the above three steps, the server can start accepting client connection requests. To receive client connection requests in a timely manner, we can run a loop that calls the accept function, which is used to accept client connections.

It is important to note that the accept function is a blocking function, which means that if there are no client connection requests, the server’s execution flow will be blocked at the accept function. Once a client connection request arrives, accept will no longer block. Instead, it will handle the connection request, establish a connection with the client, and return the connected socket.

Finally, the server can use the recv or send function to receive and handle read/write requests on the connected socket that was returned earlier, or send data back to the client.

The code below shows this process, take a look:

listenSocket = socket(); // Call the socket system call to create an active socket
bind(listenSocket);  // Bind the address and port
listen(listenSocket); // Convert the default active socket to a passive socket used by the server, i.e., a listening socket
while (1) { // Loop to listen for incoming client connection requests
   connSocket = accept(listenSocket); // Accept client connection
   recv(connsocket); // Read data from the client, can only handle one client at a time
   send(connsocket); // Send data back to the client, can only handle one client at a time
}

However, from the above code, you may notice that although it can achieve communication between the server and the client, each time the accept function is called, it can only handle one client connection. Therefore, if we want to handle requests from multiple concurrent clients, we need to use the multi-threading approach to handle the requests on multiple client connections established through the accept function.

Using this approach, after the accept function returns the connected socket, we need to create a thread and pass the connected socket to the created thread, which will be responsible for the subsequent data read/write on this connection socket. At the same time, the server’s execution flow will call the accept function again, waiting for the next client connection.

The example code below demonstrates using multi-threading to improve the server’s concurrent client handling capabilities:

listenSocket = socket(); // Call the socket system call to create an active socket
bind(listenSocket);  // Bind the address and port
listen(listenSocket); // Convert the default active socket to a passive socket used by the server, i.e., a listening socket
while (1) { // Loop to listen for incoming client connections
   connSocket = accept(listenSocket); // Accept client connection and return connected socket
   pthread_create(processData, connSocket); // Create a new thread to handle the connected socket
}

// Handle read/write requests on the connected socket
processData(connSocket) {
   recv(connsocket); // Read data from the client, can only handle one client at a time
   send(connsocket); // Send data back to the client, can only handle one client at a time
}

However, although this method can improve the server’s concurrent processing capabilities, unfortunately, the main execution flow of Redis is handled by a single thread and cannot use multi-threading to improve concurrent processing capabilities. Therefore, this method does not work for Redis.

So, are there any other methods that can help Redis improve its concurrent client processing capabilities?

This is where the IO multiplexing feature provided by the operating system comes in. In the basic Socket programming model, the accept function can only listen for client connections on a single listening socket, and the recv function can only wait for requests on a single connected socket.

IO multiplexing allows the program to use multiplexing functions to simultaneously monitor requests on multiple sockets. This can include both connection requests on the listening socket and read/write requests on connected sockets. When there is a request on one or more sockets, the multiplexing function will return. At this point, the program can process the requests on these ready sockets, such as reading the request content on the ready connected socket.

Because the Linux operating system is widely used in practical applications, in this lesson, we will mainly focus on learning the IO multiplexing mechanisms provided by Linux. Linux provides three main IO multiplexing mechanisms: select, poll, and epoll. Now, let’s learn about the implementation ideas and usage methods of these three mechanisms respectively. Then, we will see why Redis usually chooses to use the epoll mechanism to implement network communication.

Using select and poll mechanisms to achieve IO multiplexing #

First, let’s understand the programming model of the select mechanism.

However, before we dive into the details, we need to know what key points we need to grasp when learning about IO multiplexing mechanisms. This can help us quickly understand the connections and differences between different mechanisms. In fact, when we learn about IO multiplexing mechanisms, we need to be able to answer the following questions:

  • First, which events does the multiplexing mechanism listen to on the socket?
  • Second, how many sockets can the multiplexing mechanism listen to?
  • Third, how does the multiplexing mechanism find the ready socket when there is one?

Select mechanism and usage #

One important function in the select mechanism is the select function. For the select function, its parameters include the number of file descriptors to listen to, the three sets of descriptors to be listened to (readfds, writefds, and exceptfds), and the timeout for blocking wait during listening.

Here is the prototype of the select function:

int select(int nfds, fd_set *readfds, fd_set *writefds, fd_set *exceptfds, struct timeval *timeout)

Here, it is important to note that for each socket, Linux has a file descriptor, which is a non-negative integer used to uniquely identify the socket. Therefore, in the function of the multiplexing mechanism, Linux usually uses file descriptors as parameters. With the file descriptor, the function can find the corresponding socket and perform operations such as listening, reading, and writing.

So the parameters readfds, writefds, and exceptfds of the select function represent the sets of descriptors being listened to, which are actually sets of sockets being listened to. So why are there three sets?

This is related to the first question I just mentioned, which is “which events does the multiplexing mechanism listen to?” The select function uses the three sets to represent the three types of events being listened to: read data events (corresponding to the readfds set), write data events (corresponding to the writefds set), and exception events (corresponding to the exceptfds set).

We can further see that the types of parameters readfds, writefds, and exceptfds in the select function are fd_set structures, whose definition is mainly as follows. Among them, fd_mask is an alias for the long int type, and the sizes defined by the macros __FD_SETSIZE and __NFDBITS are 1024 and 32 by default.

typedef struct {
   ...
   __fd_mask  __fds_bits[__FD_SETSIZE / __NFDBITS];
   ...
} fd_set

Therefore, the definition of the fd_set structure is actually an array of long int types, with a total of 32 elements in the array (1024/32=32). Each element is 32 bits (the size of long int), and each bit can be used to represent the state of a file descriptor.

Ok, after understanding the definition of the fd_set structure, we can answer the second question I mentioned earlier. The select function can listen to 1024 descriptors for each descriptor set.

Next, let’s understand how to use the select mechanism to achieve network communication.

First, before calling the select function, we can create the descriptor sets that will be passed to the select function, and then create listening sockets. In order to monitor the created listening socket using the select function, we need to add the descriptor of this socket to the created descriptor sets.

Then, we can call the select function and pass the created descriptor sets as parameters to the select function. When the select function detects that descriptors are ready, it will end blocking and return the number of ready file descriptors.

At this point, we can search the descriptor set to find out which descriptors are ready. Then, we can handle the ready sockets. For example, if there are descriptors ready in the readfds set, it means there are read events on these ready descriptors’ corresponding sockets, and in this case, we can read data from these sockets.

Because the select function can monitor the state of 1024 file descriptors at a time, the select function may return multiple ready file descriptors at once. In this case, we can use a loop to sequentially perform read, write, or exception handling operations on the ready descriptors’ corresponding sockets.

I have also drawn a diagram showing the basic process of using the select function for network communication, you can take a look.

The following code shows the key steps and main function calls for concurrent client processing using the select function:

int sock_fd, conn_fd; // Variables for the listening socket and the connected socket
sock_fd = socket(); // Create a socket
bind(sock_fd); // Bind the socket
listen(sock_fd); // Listen on the socket and convert it to a listening socket

fd_set rset; // Descriptor set to be listened to, focusing on read events on descriptors

int max_fd = sock_fd;

// Initialize the rset array, set each element to 0 with the FD_ZERO macro
FD_ZERO(&rset);
// Set the descriptor corresponding to sock_fd in the rset array to 1 using the FD_SET macro, indicating that this descriptor needs to be monitored
FD_SET(sock_fd, &rset);

// Set the timeout
struct timeval timeout;
timeout.tv_sec = 3;
timeout.tv_usec = 0;

while (1) {
   // Call the select function to check whether the file descriptor saved in the rset array has a read event ready, and return the number of ready file descriptors
   n = select(max_fd + 1, &rset, NULL, NULL, &timeout);

   // Use the FD_ISSET macro to check whether the file descriptor corresponding to sock_fd in the rset array is ready
   if (FD_ISSET(sock_fd, &rset)) {
       // If sock_fd is ready, it means a client has connected; call the accept function to establish the connection
       conn_fd = accept();
       // Set the descriptor corresponding to conn_fd in the rset array to 1, indicating that this descriptor needs to be monitored
       FD_SET(conn_fd, &rset);
   }

   // Check the file descriptors of the connected sockets one by one
   for (i = 0; i < maxfd; i++) {
       // Use the FD_ISSET macro to check whether the file descriptor is ready in the rset array
       if (FD_ISSET(i, &rset)) {
         // Data is ready to be read, perform read data processing
       }
   }
}

However, from the introduction just now, you may find that the select function has two design drawbacks:

  • Firstly, the select function has a limitation on the number of file descriptors that a single process can listen to. The number of file descriptors it can listen to is determined by __FD_SETSIZE, with a default value of 1024.
  • Secondly, after the select function returns, we need to iterate through the descriptor set to find out which descriptors are ready. This looping process incurs certain overhead and reduces the performance of the program. So, in order to solve the limitation of select function being limited to 1024 file descriptors, the poll function made improvements.

Mechanism and Usage of poll #

The main function of the poll mechanism is the poll function. Let’s first look at its prototype definition, as shown below:

int poll(struct pollfd *__fds, nfds_t __nfds, int __timeout);

Where the parameter __fds is an array of pollfd structures, the parameter __nfds represents the number of elements in the __fds array, and __timeout represents the timeout for which the poll function blocks.

The pollfd structure contains the descriptor to be listened for and the event types to be listened for on that descriptor. We can see this from the definition of the pollfd structure, as shown below. The pollfd structure contains three member variables: fd, events, and revents, which respectively represent the file descriptor to be listened to, the event types to be listened for, and the actual events that occurred.

struct pollfd {
    int fd;         // the file descriptor to listen to
    short int events;       // the event type to listen for
    short int revents;      // the actual event type that occurred
};

The event types to be listened to and the actual event types that occurred in the pollfd structure are represented by the following three macro definitions: POLLRDNORM, POLLWRNORM, and POLLERR, which respectively represent the readable, writable, and error events.

#define POLLRDNORM  0x040       // readable event
#define POLLWRNORM  0x100       // writable event
#define POLLERR     0x008       // error event

Okay, now that we understand the parameters of the poll function, let’s see how to use the poll function to accomplish network communication. This process can be divided into three steps:

  • Step 1: Create a pollfd array and a listening socket, and bind them.
  • Step 2: Add the listening socket to the pollfd array and set it to listen for read events, which are the client’s connection requests.
  • Step 3: Loop through the poll function to check if there are any ready file descriptors in the pollfd array.

During the looping process in Step 3, the processing logic can be divided into two cases:

  • If the connection socket is ready, this means there is a client connection. We can call the accept function to accept the connection, create a connected socket, add it to the pollfd array, and listen for read events.
  • If the connected socket is ready, this means the client has read/write requests. We can call the recv/send function to handle the read/write requests.

I’ve created the following diagram to illustrate the process of using the poll function, which you can study and master.

Also, to help you understand the use of the poll function in code, I have written a sample code as shown below:

int sock_fd, conn_fd; // variables for the listening socket and the connected socket
sock_fd = socket(); // create a socket
bind(sock_fd);   // bind the socket
listen(sock_fd); // listen on the socket and convert it to a listening socket

// the number of file descriptors that the poll function can listen to, can be greater than 1024
#define MAX_OPEN = 2048

// pollfd structure array, corresponding to file descriptors
struct pollfd client[MAX_OPEN];

// add the created listening socket to the pollfd array and listen for its readable event
client[0].fd = sock_fd;
client[0].events = POLLRDNORM;
maxfd = 0;

// initialize other elements of the client array to -1
for (i = 1; i < MAX_OPEN; i++)
    client[i].fd = -1;

while (1) {
    // call the poll function to check if any file descriptors in the client array are ready, and return the number of ready file descriptors
    n = poll(client, maxfd + 1, &timeout);
    // if the file descriptor of the listening socket has a readable event, then process it
    if (client[0].revents & POLLRDNORM) {
        // there is a client connection; call the accept function to establish the connection
        conn_fd = accept();

        // save the newly established connected socket
        for (i = 1; i < MAX_OPEN; i++) {
            if (client[i].fd < 0) {
                client[i].fd = conn_fd; // save the file descriptor of the newly established connection to the client array
                client[i].events = POLLRDNORM; // set the file descriptor to listen for readable events
                break;
            }
        }
        maxfd = i;
    }

    // check the file descriptors of the connected sockets one by one
    for (i = 1; i < MAX_OPEN; i++) {
        if (client[i].revents & (POLLRDNORM | POLLERR)) {
            // data is readable or an error occurred, process the read data or handle the error
        }
    }
}

In fact, compared to the select function, the improvement of the poll function mainly lies in its ability to listen to more than 1024 file descriptors at a time. However, after calling the poll function, we still need to traverse each file descriptor to check if it is ready and then process it.

So, is there a way to avoid traversing each descriptor? This is what I will introduce to you next, the epoll mechanism.

Using epoll Mechanism for IO Multiplexing #

First of all, the epoll mechanism uses the epoll_event structure to record the file descriptors to be monitored and their corresponding event types. This is similar to how the poll mechanism uses the pollfd structure.

For the epoll_event structure, it contains the epoll_data_t union variable and an integer variable called events. The epoll_data_t union has a member variable called fd that records the file descriptor, and the events variable takes different macro-defined values to indicate the event types the file descriptor is interested in. Some common event types include:

  • EPOLLIN: read event, indicates that the file descriptor corresponding to the socket has data available to read.
  • EPOLLOUT: write event, indicates that the file descriptor corresponding to the socket has data to be written.
  • EPOLLERR: error event, indicates an error occurred on the file descriptor associated with the socket.

The following code shows the definition of the epoll_event structure and the epoll_data union:

typedef union epoll_data
{
  ...
  int fd;  // record the file descriptor
  ...
} epoll_data_t;


struct epoll_event
{
  uint32_t events;  // event types the epoll listens to
  epoll_data_t data; // application data
};

Now that we know, when using the select or poll function, we can create a file descriptor set or a pollfd array and add the file descriptors we want to monitor to the array.

However, for the epoll mechanism, we need to first call the epoll_create function to create an epoll instance. This epoll instance internally maintains two structures, one for recording the file descriptors to be monitored and the other for the file descriptors that are ready for use. The file descriptors that are ready for use will be returned to the user program for processing.

Therefore, when using the epoll mechanism, we don’t need to iterate through and check which file descriptors are ready for use, as we do with select and poll. This way, epoll improves the efficiency compared to select and poll.

After creating the epoll instance, we need to use the epoll_ctl function to add the file descriptors to be monitored and their corresponding event types, and use the epoll_wait function to obtain the ready file descriptors.

I’ve created a diagram showing the process of network communication using epoll. You can take a look:

The following code demonstrates the process of using the epoll function:

int sock_fd, conn_fd; // variables for the listening socket and connected socket
sock_fd = socket() // create a socket
bind(sock_fd)   // bind the socket
listen(sock_fd) // listen on the socket, turning it into a listening socket

epfd = epoll_create(EPOLL_SIZE); // create epoll instance
// create an array of epoll_event structures to save file descriptors and event types
ep_events = (epoll_event*)malloc(sizeof(epoll_event) * EPOLL_SIZE);

// create an epoll_event variable
struct epoll_event ee;
// listen for read events
ee.events = EPOLLIN;
// the file descriptor to listen to is the newly created listening socket
ee.data.fd = sock_fd;

// add the listening socket to the monitored list
epoll_ctl(epfd, EPOLL_CTL_ADD, sock_fd, &ee);

while (1) {
   // wait for the ready descriptors to be returned
   n = epoll_wait(epfd, ep_events, EPOLL_SIZE, -1);
   // iterate through all ready descriptors
   for (int i = 0; i < n; i++) {
       // if the listening socket descriptor is ready, it means a new client connection has arrived
       if (ep_events[i].data.fd == sock_fd) {
          conn_fd = accept(sock_fd); // call accept() to establish the connection
          ee.events = EPOLLIN;
          ee.data.fd = conn_fd;
          // add the newly created connected socket descriptor to be monitored for subsequent read events
          epoll_ctl(epfd, EPOLL_CTL_ADD, conn_fd, &ee);

       } else { // if the connected socket descriptor is ready, we can read data
           ... // read data and handle it
       }
   }
}

Alright, now you know how to use the epoll function. In fact, it is because epoll allows for customization of the number of descriptors to be monitored and can directly return ready descriptors that Redis, when designing and implementing its network communication framework, encapsulates functions such as epoll_create, epoll_ctl, epoll_wait, and read/write events based on the epoll mechanism. Redis is able to efficiently handle concurrent client access even though it runs in a single thread.

Summary #

In today’s lesson, I introduced you to the operating system’s underlying mechanism for network communication in Redis, which is the IO multiplexing mechanism.

As Redis is a single-threaded program, using the basic Socket programming model would only allow listening on one listening socket or one connected socket. When a Redis instance faces many concurrent clients, this approach becomes inefficient.

Therefore, compared to basic Socket communication, using the IO multiplexing mechanism allows us to obtain multiple ready sockets at once, avoiding the overhead of checking each socket individually.

In this lesson, I used the most commonly used Linux operating system as an example to introduce you to the three IO multiplexing mechanisms provided by the Linux system: select, poll, and epoll. These three mechanisms differ in terms of the number of descriptors they can listen to and the method of finding ready descriptors. You can refer to the following diagram to understand their differences. These differences also determine that epoll is more efficient and widely used compared to select and poll.

Finally, I want to mention that although I did not introduce you to the Redis source code in this lesson, learning the mechanism and workflow of IO multiplexing is actually the foundation of understanding the Redis event-driven framework. The ae_select.c and ae_epoll.c files in Redis implement IO multiplexing using the select and epoll mechanisms respectively. In the upcoming lessons 10 and 11, I will introduce how the Redis event-driven framework is developed and run based on epoll, as well as the event types and handling methods in the Redis event-driven framework. With this, you will have a comprehensive understanding of the underlying support, framework operation, and event types and handling in the Redis event-driven framework.

One Question per Lesson #

In the Redis event-driven framework code, both the select and epoll mechanisms, which are available on the Linux system, are used. Do you know why Redis did not use the poll mechanism?