19 How to Use Coroutines to Optimize Multithreaded Businesses

19 How to use coroutines to optimize multithreaded businesses #

Hello, I am Liu Chao.

In the past couple of years, many internet companies in China have started to use or transition to Go language. One important reason for this is the superior performance of Go language, which is closely related to the lightweight thread implementation in Go called Goroutines (coroutines). So, what are the differences between the implementation of Go coroutines and Java threads?

Thread Implementation Models #

Before understanding the difference between coroutines and threads, it’s a good idea to first understand the various ways thread implementation is achieved at the underlying level to lay a foundation for further learning.

There are three main ways to implement threads: the 1:1 thread model, which maps lightweight processes to kernel threads; the N:1 thread model, which maps user threads to kernel threads; and the N:M thread model, which combines user threads and lightweight processes.

1:1 Thread Model #

The kernel threads (Kernel-Level Thread, KLT) I mentioned earlier are threads supported by the operating system kernel. The kernel schedules threads through a scheduler and is responsible for thread switching.

In Linux operating system programming, it is common to create a child process using the fork() function to represent a thread in the kernel. When a process calls the fork() function, the system allocates resources to the new process, such as space for storing data and code. Then, all the values of the original process are copied to the new process, with only a few values (such as PID) being different from the original process. This effectively duplicates the main process.

Using fork() to create child processes for parallel execution generates a large amount of redundant data, occupying a significant amount of memory space and consuming a lot of CPU time for initializing memory space and copying data.

If the data is the same, why not share this data with the main process? This is where lightweight processes (Light Weight Process, LWP) come in.

Compared to threads created using the fork() system call, LWPs use the clone() system call to create threads. This function copies some of the parent process’s resource data structures, and the contents that can be copied are optional. Any resources that are not copied can be shared with the child process through pointers. Therefore, lightweight processes have smaller execution units and faster execution speeds. LWPs are mapped one-to-one with kernel threads.

N:1 Thread Model #

The 1:1 thread model has the drawback of involving user and kernel space switching in thread creation and switching, resulting in significant performance overhead. Moreover, it has limitations, mainly that the system’s resources are limited and cannot support the creation of a large number of LWPs.

The N:1 thread model can effectively address these two issues of the 1:1 thread model.

In this thread model, thread creation, synchronization, destruction, and scheduling are all done in user space, without the help of the kernel. In other words, there is no user and kernel space switching during thread creation, synchronization, and destruction, making thread operations very fast and low-cost.

N:M Thread Model #

One drawback of the N:1 thread model is that the operating system cannot be aware of user-level threads, making it easy for a thread to be blocked when making a system call and causing the entire process to be blocked.

The N:M thread model is a hybrid thread management model based on the above two thread models. It supports user-level threads to connect with kernel threads through LWPs, and there is an N:M mapping relationship between the number of user-level threads and the number of kernel-level LWPs.

Having understood these three thread models, you can now clearly understand the difference between the implementation of Go goroutines and Java threads.

The implementation of the Thread#start method in JDK 1.8 Thread.java is actually achieved through a native call to the start0 method. In Linux, the JVM Thread is implemented based on pthread_create, which actually calls clone() to perform the system call to create a thread.

Therefore, currently, Java on the Linux operating system adopts the user threads plus lightweight threads model, where one user thread maps to one kernel thread, i.e., the 1:1 thread model. Since threads are scheduled by the kernel, switching from one thread to another involves context switching.

On the other hand, Go language uses the N:M thread model to implement its own scheduler. It multiplexes (or schedules) M goroutines on N kernel threads, and the context switching of goroutines is done by the coroutine scheduler in user space, so it does not need to enter the kernel. Compared to the context switching cost, this is very small.

Implementation Principles of Coroutines #

Coroutines are not only implemented in Go, but in fact, most programming languages have their own set of coroutine implementations, including C#, Erlang, Python, Lua, JavaScript, Ruby, etc.

You may be more familiar with processes and threads compared to coroutines. Processes generally represent an application service, and multiple threads can be created within an application service. However, coroutines are different from processes and threads. We can think of coroutines as a class of functions or code within a function, and we can easily create multiple coroutines within a main thread.

The difference between calling a coroutine and calling a function is that a coroutine can suspend its execution by pausing or blocking, while other coroutines can continue executing. Here, suspension refers to suspension within the program (user space), while transferring the execution control to other coroutines. When the coroutines that are waiting for execution control have completed, the suspended coroutines will be awakened from the suspension point. Suspension and awakening of coroutines are accomplished through a scheduler.

With the help of the diagram below, you can have a clearer understanding of how coroutines implemented based on the N:M thread model work.

Assuming that in the program, two threads are created as coroutines by default, coroutines ABCD… are created in the main thread and stored in the ready queue. The scheduler will first allocate a working thread A to execute coroutine A, and another working thread B to execute coroutine B. Other created coroutines will be put in a queue to wait for scheduling.

When coroutine A calls the pause method or is blocked, it will enter the suspended queue, and the scheduler will call other coroutines in the waiting queue to preempt thread A for execution. When coroutine A is awakened, it needs to re-enter the ready queue and preempt the thread through the scheduler. If the preemption is successful, coroutine A will continue to execute. Otherwise, it will continue to wait for thread preemption.

Compared to threads, coroutines have fewer CPU context switches caused by synchronous resource competition. They are more suitable for I/O-intensive applications, especially in network requests where there is a significant amount of time spent waiting for backend responses. Coroutines can ensure that threads are not blocked in waiting for network responses, fully utilizing the capabilities of multi-core and multi-threading. However, for CPU-intensive applications, since the CPU is usually busy in most cases, the advantages of coroutines are not particularly significant.

Kilim Coroutine Framework #

Although many programming languages have implemented coroutines, Java, as a native language, does not currently support coroutines. But don’t worry, we can still use coroutines in Java through the Kilim coroutine framework.

Currently, the Kilim coroutine framework is widely used in Java, allowing developers to use coroutines in Java at a low cost.

To introduce Kilim in Java, it is different from simply importing third-party components. In addition to importing the jar package, we also need to enhance the bytecode generated by compiling Java code through the weaving tool provided by Kilim. For example, we need to recognize which methods are suspendable and add context processing to relevant methods. There are generally four ways to achieve this weaving operation:

Use Maven plugins during compilation.
Invoke the Kilim.tools.Weaver tool at runtime.
Use the Kilim.tools.Kilim invoking to call Kilim’s class file at runtime.
Add if (kilim.tools.Kilim.trampoline(false,args)) return; in the main function.

The Kilim framework includes four core components: Task, Fiber, Scheduler, and Mailbox.

The Task object is mainly used to execute business logic, which can be compared to a Thread in multithreading. Like the Thread class, the Task also has a run method, but in Task, the method is named execute. We can write the business logic operations to be performed in the coroutine in the execute method.

Just like threads implemented by Thread, coroutines implemented by Task also have states, including Ready, Running, Pausing, Paused, and Done. After a Task object is created, it is in the Ready state. After calling the execute() method, the coroutine enters the Running state. During runtime, the coroutine can be paused. The state during pause is Pausing, and the state after pause is Paused. A paused coroutine can be resumed. The state after a coroutine ends normally is Done.

The Fiber object is similar to the thread stack in Java and is mainly used to maintain the execution stack of the Task. Fiber is the key to implementing N:M thread mapping.

The Scheduler is the core scheduler for implementing coroutines, responsible for dispatching Task to the designated worker threads for execution. The default initialization number of worker threads is the number of CPUs on the machine.

The Mailbox object functions like an email box, allowing coroutines to communicate and share data. The biggest difference between coroutines and threads is that threads achieve data sharing through shared memory, while coroutines use communication to achieve data sharing. This is primarily to avoid thread safety issues caused by shared memory data.

Performance Comparison of Coroutines and Threads #

Next, let’s compare the performance of coroutines and threads using a simple producer-consumer example. You can download the code locally from Github.

Source code for Java multi-threading implementation:

public class MyThread {
	private static Integer count = 0;
	private static final Integer FULL = 10;
	private static String LOCK = "lock";
 
	public static void main(String[] args) {
		MyThread test1 = new MyThread();

		long start = System.currentTimeMillis();

		List<Thread> list = new ArrayList<Thread>();
		for (int i = 0; i < 1000; i++) {
			Thread thread = new Thread(test1.new Producer());
			thread.start();
			list.add(thread);
		}

		for (int i = 0; i < 1000; i++) {
			Thread thread = new Thread(test1.new Consumer());
			thread.start();
			list.add(thread);
		}

		try {
			for (Thread thread : list) {
				thread.join();
			}
		} catch (InterruptedException e) {
			e.printStackTrace();
		}

		long end = System.currentTimeMillis();
		System.out.println("Execution time for threads: " + (end - start));
	}
    
	class Producer implements Runnable {
		public void run() {
			for (int i = 0; i < 10; i++) {
				synchronized (LOCK) {
					while (count == FULL) {
						try {
							LOCK.wait();
						} catch (Exception e) {
							e.printStackTrace();
						}
					}
					count++;
					System.out.println(Thread.currentThread().getName() + " producer produces, current count: " + count);
					LOCK.notifyAll();
				}
			}
		}
	}

	class Consumer implements Runnable {
		public void run() {
			for (int i = 0; i < 10; i++) {
				synchronized (LOCK) {
					while (count == 0) {
						try {
							LOCK.wait();
						} catch (Exception e) {
						}
					}
					count--;
					System.out.println(Thread.currentThread().getName() + " consumer consumes, current count: " + count);
					LOCK.notifyAll();
				}
			}
		}
	}
}

Source code for Kilim coroutine framework:

public class Coroutine {
	static Map<Integer, Mailbox<Integer>> mailMap = new HashMap<Integer, Mailbox<Integer>>();

	public static void main(String[] args) {

		if (kilim.tools.Kilim.trampoline(false,args)) return;
		Properties propes = new Properties();
		propes.setProperty("kilim.Scheduler.numThreads", "1");
		System.setProperties(propes);
		long startTime = System.currentTimeMillis();
		for (int i = 0; i < 1000; i++) {
			Mailbox<Integer> mb = new Mailbox<Integer>(1, 10);
			new Producer(i, mb).start();
			mailMap.put(i, mb);
		}
		
		for (int i = 0; i < 1000; i++) {
			new Consumer(mailMap.get(i)).start();
		}
		
		Task.idledown();
		
		long endTime = System.currentTimeMillis();
	        
	    System.out.println( Thread.currentThread().getName()  + " Total execution time: " + (endTime - startTime));
	}
}

class Producer extends Task<Object> {
 
	Integer count = null;
	Mailbox<Integer> mb = null;
 
	public Producer(Integer count, Mailbox<Integer> mb) {
		this.count = count;
		this.mb = mb;
	}
 
	public void execute() throws Pausable {
		count = count * 10;
		for (int i = 0; i < 10; i++) {
			mb.put(count);
			System.out.println(Thread.currentThread().getName() + " producer produces, current count: " + mb.size() + " produced: " + count);
			count++;
		}
	}
}

class Consumer extends Task<Object> {
 
	Mailbox<Integer> mb = null;
 
	public Consumer(Mailbox<Integer> mb) {
		this.mb = mb;
	}
 
	public void execute() throws Pausable {
		Integer c = null;
		for (int i = 0; i < 10000; i++) {
			c = mb.get();
			
			if (c == null) {
				System.out.println(" Counting ");
			} else {
				System.out.println(Thread.currentThread().getName() + " consumer consumes, current count: " + mb.size() + " consumed: " + c);
				c = null;
			}
		}
	}
}

In this example, I created 1000 producers and 1000 consumers, each producer producing 10 items, and the 1000 consumers simultaneously consuming the items. The results of running both examples are as follows:

Execution time for threads: 2761

Execution time for coroutines: 1050

From the above performance comparison, we can see that coroutines perform better in scenarios with severe blocking. In fact, I/O blocking scenarios are the main application of coroutines in Java.

Summary #

Coroutines are closely related to threads. Coroutines can be considered as code blocks running on threads. The suspension operation provided by coroutines allows them to pause execution without blocking the thread.

Coroutines are also lightweight resources. Even if thousands of coroutines are created, it is not a heavy burden on the system. In contrast, if thousands of threads are created in a program, it can greatly strain the system. It can be said that the design of coroutines greatly improves the efficiency of thread utilization.

Through today’s learning, when others talk about the advantages of Go language in network programming, you will not be confused. Those of us who are learning Java should not feel that coroutines are too far away from us. Coroutines are a design concept that is not limited to a specific language. Moreover, Java can already implement coroutines with the help of coroutine frameworks.

However, it can be said that coroutines are more maturely applied in Go language, while coroutines in Java are currently not very stable and lack validation from large-scale projects. This means that Java’s coroutine design still has a long way to go.

Reflection Question #

In Java, besides the Kilim framework, do you know of any other coroutine frameworks that can help implement coroutines in Java? Have you used them before?