35 Fundamentals Overview of C10 K and C1000 K

35 Fundamentals Overview of C10K and C1000K #

Hello, I’m Ni Pengfei.

In the previous content, we learned about the basic principles of Linux networking and performance monitoring methods. Let’s briefly review. Linux networking is based on the TCP/IP model, building its network protocol stack, which divides complex network functions into four different layers: the application layer, transport layer, network layer, and network interface layer. This solves the problem of device heterogeneity in the network environment and decouples the complexity of network protocols.

Based on the TCP/IP model, we also outlined the sending and receiving process of network packets in Linux networking and the corresponding performance metrics. When an application program sends or receives network packets through socket interfaces, these packets need to be processed layer by layer in the protocol stack. We usually use metrics such as bandwidth, throughput, latency, and PPS to measure network performance.

Today, let’s primarily review the classic C10K and C1000K problems to better understand the working principles of Linux networking and further analyze how to achieve support for C10M on a single machine.

Note that the letter “C” in C10K and C1000K stands for “Client”. C10K means solving the problem of a single machine handling 10,000 requests simultaneously (concurrent connections), while C1000K means a single machine supporting the handling of 1 million requests (concurrent connections).

C10K #

The C10K problem was first raised by Dan Kegel in 1999. At that time, servers were still running on 32-bit systems with Linux 2.2 (later upgraded to 2.4 and 2.6, with 2.6 supporting x86_64), only equipped with a small amount of memory (2GB) and a gigabit network card.

How to support 10,000 concurrent requests on such a system?

In terms of resources, for a server with 2GB of memory and a gigabit network card, as long as each request occupies less than 200KB (2GB/10,000) of memory and 100Kbit (1000Mbit/10,000) of network bandwidth, the physical resources will be sufficient. Therefore, the problem lies in the software, especially the network I/O model.

Speaking of I/O models, I have previously introduced file I/O in the principles of the file system. In fact, the network I/O model is similar. Before C10K, network processing in Linux was done using synchronous blocking mode, which means that each request was allocated a process or thread. This approach works fine when the number of requests is only 100, but when it increases to 10,000, scheduling and context switching among 10,000 processes or threads, as well as the memory they occupy, will become bottlenecks.

Since allocating one thread per request is not suitable, two questions need to be addressed in order to support 10,000 concurrent requests.

First, how to handle multiple requests within a single thread, i.e., respond to multiple network I/O operations within a single thread. Under the previous synchronous blocking mode, a thread could only handle one request, which is no longer applicable here. So, can non-blocking I/O or asynchronous I/O be used to handle multiple network requests?

Second, how to handle client requests more resource-efficiently, i.e., serve these requests with fewer threads. Can we still use the original 100 or fewer threads to serve the current 10,000 requests?

Of course, in reality, the C10K problem has long been solved. Before continuing to learn the following content, you can first think about these two questions on your own. Combining the knowledge you have learned earlier, do you already have some ideas on how to solve them?

Optimizing the I/O Model #

Asynchronous and non-blocking I/O solutions are likely familiar to you, as they are commonly used in network programming. What does I/O multiplexing mean?

But before we dive into the details, let me first explain two types of I/O event notification: level-triggered and edge-triggered, which are commonly used in file descriptors of socket interfaces.

Level-triggered: As long as the file descriptor can perform I/O in a non-blocking manner, a notification will be triggered. In other words, the application can check the status of the file descriptor at any time and then perform I/O operations based on its status.
Edge-triggered: A notification is sent only when the status of the file descriptor changes (i.e., an I/O request occurs). At this point, the application needs to perform as much I/O as possible until it can no longer read or write, and only then can it stop. If the I/O is not completed or cannot be processed in time for some reason, the notification for that event is lost.

Now, let’s take a look at the methods of I/O multiplexing. There are actually many ways to implement it. Let’s analyze them one by one.

The first method is to use non-blocking I/O and level-triggered notifications, such as select or poll.

Based on the principle of level-triggered, select and poll need to identify which file descriptors in the list are ready for I/O and then perform actual network I/O operations. Since I/O is non-blocking, a single thread can monitor a set of socket file descriptors, thereby achieving the goal of handling multiple requests in a single thread.

Therefore, the biggest advantage of this approach is that it is more friendly to the application. The API is very simple.

However, when applications use select and poll, they need to poll the list of file descriptors, which can be time-consuming when there are many requests. Furthermore, select and poll have some limitations.

Select uses a fixed-size vector to represent the set of file descriptors, so there is a limit on the maximum number of descriptors. For example, the default limit in a 32-bit system is 1024. Additionally, within select, checking the socket state is done through polling, and the processing time is directly proportional to the number of descriptors (O(N)).

Poll improves on the representation method used by select, using an array of variable size instead of a fixed one. This eliminates the limitation on the maximum number of descriptors (although it is still subject to the system’s file descriptor limit). However, when an application uses poll, it still needs to poll the list of file descriptors, making the processing time directly proportional to the number of descriptors (O(N)).

In addition, each time an application calls select or poll, it needs to pass the set of file descriptors from user space to kernel space, then have the kernel modify it, and finally pass it back to user space. This back and forth switching between kernel space and user space adds processing overhead.

Is there a better way to handle this? The answer is yes.

The second method is to use non-blocking I/O and edge-triggered notifications, such as epoll.

Since select and poll have so many issues, further optimizations are needed, and epoll addresses these problems very well.

Epoll uses red-black trees to manage the set of file descriptors in the kernel, eliminating the need for the application to pass in and out of this set during each operation.
- The epoll mechanism uses an event-driven approach to only focus on file descriptors that have I/O events occurring, eliminating the need to scan the entire collection.

However, it should be noted that epoll is a feature that was added in Linux 2.6 (although it was also available in 2.4, the functionality was incomplete). Because edge-triggering only notifies when a file descriptor is ready for reading or writing, the application needs to perform as much I/O as possible and handle more exceptional events.

The third method is using asynchronous I/O (AIO). In the previous content on file system principles, I mentioned the difference between asynchronous and synchronous I/O. Asynchronous I/O allows an application to initiate multiple I/O operations simultaneously without waiting for them to complete. After the I/O operations are completed, the system notifies the application using event notifications such as signals or callback functions. At this point, the application queries the results of the I/O operations.

Asynchronous I/O is also a feature that was only supported starting from Linux 2.6 and remained in an incomplete state for a long time. For example, the asynchronous I/O library provided by glibc has long been criticized by the community. Additionally, due to the difference between asynchronous I/O and our intuitive logic, careful design is required to use it, making it more difficult to utilize.

Optimization of Work Models #

Once we understand the I/O models, the optimization of request processing becomes more intuitive. After implementing I/O multiplexing, multiple requests can be handled within a single process or thread. Here, we have two different work models.

The first model is the main process and multiple worker subprocesses, which is the most commonly used model. The general workflow for this method is:

The main process executes bind() and listen(), followed by creating multiple worker processes.
Then, in each worker process, the same socket is handled using accept() or epoll_wait().

For example, the widely-used reverse proxy server Nginx works in this way. It consists of a main process and multiple worker processes. The main process is mainly responsible for initializing the socket and managing the lifecycle of the worker processes. The worker processes are responsible for actual request processing. I have created a diagram to illustrate this relationship.

Note that there is an issue called “thundering herd” when performing accept() and epoll_wait() calls. In other words, when a network I/O event occurs, multiple processes are awakened simultaneously, but only one process actually responds to this event, while the other awakened processes go back to sleep.

The “thundering herd” issue for accept() has been resolved in Linux 2.6.
Meanwhile, the issue with epoll was resolved with EPOLLEXCLUSIVE in Linux 4.5.

To avoid the “thundering herd” issue, Nginx adds a global lock (accept_mutex) in each worker process. These worker processes need to compete for the lock first, and only the process that acquires the lock will join epoll, ensuring that only one worker subprocess is awakened.

However, as you may recall from the CPU module, managing, scheduling, and context switching processes come at a high cost. So why does Nginx, which uses a multi-process model, have a very good performance?

One of the main reasons is that these worker processes do not need to be frequently created and destroyed. Instead, they sleep when there are no tasks and wake up when there are tasks. Only when a worker process exits due to certain exceptions does the main process need to create a new process to replace it.

Of course, you can also use threads instead of processes: the main thread is responsible for socket initialization and managing the state of sub-threads, while the sub-threads are responsible for actual request processing. Since thread scheduling and context switching have relatively low costs, you can further place epoll_wait() in the main thread to ensure that each event only wakes up the main thread, while the sub-threads only need to handle subsequent request processing.

The second model is the multi-process model that listens on the same port. In this model, all processes listen on the same interface and enable the SO_REUSEPORT option, allowing the kernel to balance the requests among these listening processes. The process is illustrated below.

Since the kernel ensures that only one process is awakened, there is no “thundering herd” issue. For example, Nginx has already supported this model since version 1.9.1.

(Image from Nginx official blog)

However, please note that the SO_REUSEPORT option requires a version of Linux 3.9 or later to be used.

C1000K #

Based on I/O multiplexing and request processing optimization, the C10K problem can be easily solved. However, with the improvement of server performance brought by Moore’s Law and the popularity of the Internet, it is not difficult to imagine that emerging services will have higher performance requirements.

Soon, the original C10K was unable to meet the demand, so C100K and C1000K emerged, which means the concurrency increased from 10,000 to 100,000 or even 1,000,000. From 10,000 to 100,000, it is essentially based on the theories of C10K, combining epoll with thread pool, as well as the improvement of CPU, memory, and network interface performance and capacity. In most cases, C100K can naturally be achieved.

So, can C1000K be easily implemented? Actually, it is not that simple.

First of all, in terms of physical resource usage, 1 million requests require a large amount of system resources. For example,

Assuming each request requires 16KB of memory, a total of about 15GB of memory is needed.
In terms of bandwidth, assuming only 20% of the connections are active, even if each connection only requires a throughput of 1KB/s, a total of 1.6Gb/s of throughput is needed. A gigabit network card clearly cannot handle such a large throughput, so a 10-gigabit network card needs to be configured, or larger throughput can be supported by using bonding based on multiple network cards.

Secondly, in terms of software resources, a large number of connections also occupy a large amount of software resources, such as the number of file descriptors, connection state tracking (CONNTRACK), and the size of the network protocol stack cache (such as socket read/write cache, TCP read/write cache), and so on.

Finally, the high processing cost brought by a large number of requests also comes from interrupt processing. In this case, multiple queue NICs, interrupt load balancing, CPU binding, RPS/RFS (load balancing soft interrupts to multiple CPU cores), and offloading network packet processing to network devices (such as TSO/GSO, LRO/GRO, VXLAN offload) are needed for various hardware and software optimizations.

The solution to C1000K is essentially built on the non-blocking I/O model of epoll. In addition to the I/O model, deep optimization is also required from the application to the Linux kernel, and then to CPU, memory, and network at various levels, especially the use of hardware to offload the large number of functions that were previously handled by software.

C10M #

Obviously, the demand for performance is endless. Furthermore, is it possible to handle 10 million requests simultaneously on a single machine? This is the question of C10M.

In fact, in the C1000K problem, optimizations for various software and hardware have likely reached their limits. Especially after upgrading hardware (such as having sufficient memory, network cards with large enough bandwidth, and more network offloading capabilities), you may find that no matter how much you optimize application programs and various network parameters in the kernel, achieving 10 million concurrent requests is extremely difficult.

Fundamentally, it is because the Linux kernel protocol stack performs too many heavy tasks. Starting from the hard interrupt handling program triggered by the network card interrupt, through the various layers of network protocol processing in the soft interrupt, and finally reaching the application program, this path is too long. As a result, optimization of network packet processing reaches a bottleneck and cannot be further improved.

To solve this problem, the most important thing is to bypass the lengthy path of the kernel protocol stack and directly deliver network packets to the application program for processing. There are two common mechanisms for this: DPDK and XDP.

The first mechanism, DPDK, is the standard for user-space networking. It bypasses the kernel protocol stack and allows user-space processes to handle network reception through polling.

(Image source: https://blog.selectel.com/introduction-dpdk-architecture-principles/)

You may instinctively think that polling is a symbol of low efficiency. However, let’s ask ourselves, where exactly is polling inefficient? It is mainly inefficient when the query time is much longer than the actual working time. So, let’s look at it from a different perspective. If there are new network packets to process at all times, the advantage of polling becomes obvious. For example:

In scenarios with a very high PPS (packets per second), the query time is much shorter than the actual working time, and most of the time is spent processing network packets.
By bypassing the kernel protocol stack, the complicated processes of hard interrupts, soft interrupts, and layer-by-layer processing in the Linux network protocol stack are eliminated. Application programs can optimize the processing logic for network packets based on the specific application scenario without having to worry about all the details.

In addition, DPDK optimizes the efficiency of network packet processing through mechanisms such as large pages, CPU binding, memory alignment, and pipeline concurrency.

The second mechanism, XDP (eXpress Data Path), is a high-performance network data path provided by the Linux kernel. It allows network packets to be processed before entering the kernel protocol stack and can also achieve higher performance. XDP, like the previously used bcc-tools, is implemented based on the eBPF mechanism in the Linux kernel.

The principle of XDP is shown in the following diagram:

(Image source: https://www.iovisor.org/technology/xdp)

You can see that XDP has higher requirements for the kernel; it requires Linux version 4.8 or above, and it does not provide a packet queue. Applications based on XDP are usually specialized network applications, such as IDS (intrusion detection system), DDoS defense, cilium container network plugin, etc.

Summary #

Today, I reviewed the classic C10K problem with you and further extended it to the C1000K and C10M problems.

The root cause of the C10K problem lies, on one hand, in the limited resources of the system, and on the other hand, in the synchronous blocking I/O model and the polling socket interface, which restrict the efficiency of network event processing. Epoll, introduced in Linux 2.6, perfectly solves the C10K problem, and current high-performance network solutions are based on epoll.

From C10K to C100K, it may only require increasing the physical resources of the system to meet the requirements. However, from C100K to C1000K, it becomes a problem that cannot be solved solely by adding physical resources. At this point, various optimization work is needed, including hardware interrupt processing and network function offloading, optimization of the network protocol stack such as the number of file descriptors, connection state tracking, and cache queues in the kernel, as well as optimization of the application’s work model.

To achieve C10M, it is not only about adding physical resources or optimizing the kernel and application that can solve the problem. At this point, it is necessary to use XDP to process network packets before the kernel protocol stack, or use DPDK to bypass the network protocol stack and directly process network packets in user space through polling.

Of course, in reality, in most scenarios, we don’t need millions of concurrent requests on a single machine. By adjusting the system architecture and distributing these requests to multiple servers for processing, it is usually a simpler and more scalable solution.

Reflections #

Finally, I would like to talk to you about the C10K and C1000K problems as you understand them. What performance bottlenecks related to network concurrency have you encountered? How did you analyze them? Feel free to combine the networking knowledge you have learned today and provide your own insights.

Please feel free to discuss with me and leave your comments in the comment section. You are also welcome to share this article with your colleagues and friends. Let’s practice in real scenarios and improve through communication.