16 Multithreading Adjustable How to Optimize Context Switching in Multithreading

16 Multithreading Adjustable How to optimize context switching in multithreading #

Hello, I’m Liu Chao.

Based on the explanation from the previous lesson, I believe you have a certain understanding of context switching. If there is only a single thread, it will generally not be scheduled out after a CPU call. However, if the number of runnable threads is much greater than the number of CPUs, the operating system will eventually schedule out a running thread so that other threads can use the CPU. This will lead to context switching.

Moreover, in multithreaded programming, if lock contention is present and a thread is blocked waiting for a lock, the JVM usually suspends the lock and allows it to be swapped out. If blocking occurs frequently, CPU-intensive programs will experience more context switching.

So, here’s the problem: we know that in some scenarios, using multithreading is necessary. However, multithreaded programming introduces context switching and increases performance overhead. So how do we optimize context switching in multithreading? This is the topic I want to share with you today, and I will focus on several common optimization methods.

Lock Competition Optimization #

When most people encounter performance issues in multi-threaded programming, their first reaction is often to think of locks.

Competition for lock resources in multi-threading can cause context switching. The more threads are blocked due to lock competition, the more frequent the context switching, and the larger the system’s performance overhead. From this, we can see that in multi-threaded programming, locks are not actually the root cause of performance overhead, but lock competition is.

In lectures 11-13, I focused on lock optimization, and we know that lock optimization ultimately boils down to reducing competition. In this lecture, let’s summarize some ways to optimize locks.

1. Reduce lock holding time #

We know that the longer a lock is held, the more threads are waiting for the contested resource to be released. If it is a synchronized lock resource, it not only causes context switching between threads, but may also increase context switching between processes.

In lecture 12, I shared some more specific methods, such as moving out code unrelated to the lock from the synchronized code block, especially operations with high overhead and operations that may be blocked.

  • Before optimization

    public synchronized void mySyncMethod(){
    businesscode1();
    mutextMethod();
    businesscode2(); }

  • After optimization

    public void mySyncMethod(){
    businesscode1();
    synchronized(this) { mutextMethod();
    } businesscode2(); }

2. Reduce lock granularity #

Synchronized locks can ensure the atomicity of objects, and we can consider splitting the granularity of locks into smaller parts to avoid excessive competition for a single lock resource. There are two specific ways to do this:

  • Lock separation

Unlike traditional locks, read-write locks implement lock separation, which means that read-write locks are implemented using “read locks” and “write locks”. The rule is that reads can be shared, but only one write is allowed.

The benefit of this approach is that in multi-threaded reading, reads are not mutually exclusive, read and write are mutually exclusive, and write and write are mutually exclusive. In contrast, with traditional exclusive locks that do not distinguish between read and write locks, the usual operations are: read-read mutual exclusion, read-write mutual exclusion, and write-write mutual exclusion. Therefore, in multi-threaded scenarios where reading is much more frequent than writing, lock separation avoids resource contention in high-concurrency reading, thus avoiding context switching.

  • Lock striping

When using locks to ensure the atomicity of collections or large objects, we can further divide the lock object. For example, the ConcurrentHashMap before Java 1.8, which I mentioned earlier, uses lock striping.

3. Non-blocking optimistic locks as a substitute for lock competition #

The volatile keyword is used to guarantee visibility and ordering. Reading and writing operations with volatile do not cause context switching, so the overhead is relatively small. However, volatile cannot guarantee the atomicity of variable operations because it lacks the exclusivity of locks.

On the other hand, CAS (Compare And Set) is an atomic if-then-act operation. CAS is an implementation of a lock-free algorithm that guarantees the consistency of read and write operations on a shared variable. CAS operations have three operands: the memory value V, the old expected value A, and the new value to be modified B. The modification is only performed when A and V are the same. Otherwise, nothing is done. CAS algorithm does not cause context switching. The Atomic package in Java uses the CAS algorithm to update data, eliminating the need for extra locking.

Above, we learned how to optimize lock competition at the coding level. Besides that, the JVM internally optimizes the Synchronized lock as well, as I discussed in detail in lecture 12. Let’s briefly review it here.

In JDK 1.6, the JVM divided the Synchronized lock into biased locks, lightweight locks, biased locks, and heavyweight locks, and the optimization path follows this order. The JIT compiler also optimizes the synchronization lock through lock elimination and lock coarsening when dynamically compiling synchronization blocks.

Optimization of wait/notify #

In Java, we can use the wait() method and the notify() method or the notifyAll() method of the Object class to implement communication between threads.

When a thread calls the wait() method, it will block and wait for notification from other threads (which call the notify() method or the notifyAll() method). When a thread calls the notify() method or the notifyAll() method, it will notify other threads to return from the wait() method.

Here, we use wait() / notify() to implement a simple producer and consumer example. The code is as follows:

public class WaitNotifyTest {
    public static void main(String[] args) {
        Vector<Integer> pool=new Vector<Integer>();
        Producer producer=new Producer(pool, 10);
        Consumer consumer=new Consumer(pool);
        new Thread(producer).start();
        new Thread(consumer).start();
    }
}
/**
 * Producer
 */
class Producer implements Runnable{
    private Vector<Integer> pool;
    private Integer size;

    public Producer(Vector<Integer>  pool, Integer size) {
        this.pool = pool;
        this.size = size;
    }

    public void run() {
        for(;;){
            try {
                System.out.println("Produce a product");
                produce(1);
            } catch (InterruptedException e) {
                e.printStackTrace();
            }
        }
    }
    private void produce(int i) throws InterruptedException{
        while(pool.size()==size){
            synchronized (pool) {
                System.out.println("Producer is waiting for the consumer to consume the product, the current number of products is "+pool.size());
                pool.wait();// wait for the consumer to consume
            }
        }
        synchronized (pool) {
            pool.add(i);
            pool.notifyAll();// notify the consumer to consume
        }
    }
}
 
/**
 * Consumer
 */
class Consumer implements Runnable{
    private Vector<Integer>  pool;
    public Consumer(Vector<Integer>  pool) {
        this.pool = pool;
    }
    
    public void run() {
        for(;;){
            try {
                System.out.println("Consume a product");
                consume();
            } catch (InterruptedException e) {
                e.printStackTrace();
            }
        }
    }
    
    private void consume() throws InterruptedException{
        while(pool.isEmpty()){
            synchronized (pool) {
                System.out.println("Consumer is waiting for the producer to produce the product, the current number of products is "+pool.size());
                pool.wait();// wait for the producer to produce the product
            }
        }
        synchronized (pool) {
            pool.remove(0);
            pool.notifyAll();// notify the producer to produce the product
        }
    }

}

Excessive context switching caused by the use of wait/notify #

From the image below, we can see that before the consumer thread acquires the lock for the first time, it finds that there is no product to consume. At this time, it executes the Object.wait() method, causing the thread to be suspended and enter a blocking state, which results in one context switching.

After the producer acquires the lock and executes notifyAll(), it will wake up the waiting consumer thread, which results in another context switching.

When the awakened waiting thread resumes, it needs to acquire the internal lock of the corresponding object again. At this time, the waiting thread may need to compete with other new active threads for the internal lock, which may cause additional context switching.

If multiple consumer threads are blocked at the same time and notifyAll() is used, it will wake up all the blocked threads. However, some products may still not be in stock, and prematurely waking up these consumer threads without stock may cause the threads to enter a blocked state again, resulting in unnecessary context switching.

img

Optimize the use of wait/notify to reduce context switching #

First, in multiple different consumer scenarios, Object.notify() can be used instead of Object.notifyAll(). Because Object.notify() only wakes up the specified thread and does not prematurely wake up other blocked threads that have not met the requirements, it can reduce the corresponding context switching.

Second, after the producer executes Object.notify() / notifyAll() to wake up other threads, it should release the internal lock as soon as possible to avoid other threads holding the lock for a long time after being awakened, which can avoid waiting for the release of the lock when the awakened thread tries to acquire the lock again.

Finally, to avoid waiting for a long time, we often use Object.wait(long) to set a timeout for waiting. However, the thread cannot distinguish whether it returns because of the timeout or being woken up by a notified thread, which may cause the thread to try to acquire the lock again and increase context switching.

Here, I suggest using the Lock lock and the Condition interface to replace the wait / notify in the synchronized internal lock to implement waiting / notifying. This can not only solve the problem of Object.wait(long) not being able to distinguish, but also solve the problem of premature wake-up of threads.

The await method, signal method, and signalAll method defined in the Condition interface are equivalent to Object.wait(), Object.notify(), and Object.notifyAll() respectively.

Setting the Thread Pool Size Reasonably to Avoid Creating Too Many Threads #

The number of threads in a thread pool should not be set too large, because once the total number of working threads in the thread pool exceeds the number of processors the system has, it will result in excessive context switching. More details on how to set the thread pool size reasonably will be discussed in Lecture 18.

There is another situation where the thread pool size is not directly exposed to us in some methods used to create thread pools. For example, when creating a thread pool using Executors.newCachedThreadPool(), the thread pool will reuse its internal idle threads to process new submitted tasks. If there are no idle threads available, new threads will be created without being limited by MAX_VALUE. In scenarios with a large number of time-consuming tasks, this type of thread pool will create a significant number of working threads, leading to frequent context switching. Therefore, this kind of thread pool is only suitable for handling a large quantity of short and non-blocking tasks.

Implementing Non-Blocking Waiting with Coroutines #

When many people hear the term “coroutines,” they immediately think of the Go programming language. For most Java programmers, coroutines may still be unfamiliar, but their usage in Go is quite mature.

Coroutines are a lighter-weight alternative to threads. Unlike processes and threads, which are managed by the operating system kernel, coroutines are entirely controlled by the program itself and executed in user mode. Coroutines eliminate the context-switching overhead associated with thread switches, resulting in significant performance improvements. The application of coroutines in multithreaded scenarios will be discussed in detail in Lesson 18.

Reducing Java Virtual Machine Garbage Collection #

In the previous lecture on the causes of context switching, we mentioned that “garbage collection can lead to context switching.”

Many JVM garbage collectors (such as the serial collector and ParNew collector) generate memory fragmentation when reclaiming old objects, requiring memory compaction during this process. Moving surviving objects entails a change in their memory addresses, necessitating the suspension of threads before moving them and waking them up again after the movement is completed. Therefore, reducing the frequency of JVM garbage collection can effectively reduce context switching.

Summary #

Context switching is one of the reasons for the performance overhead in multi-threaded programming. Operations such as competing for locks, inter-thread communication, and excessive thread creation can all lead to context switching in a system. In addition, I/O blocking and JVM garbage collection can also increase context switching.

In general, frequent context switching can impact the performance of a system, so we should try to avoid it. Furthermore, we can also consider context switching as a performance metric for the system and incorporate it into service performance monitoring to prevent potential issues.

Reflection Question #

In addition to the factors I mentioned in my previous analysis that lead to thread context switching, do you know of any other factors that contribute to it? What are the corresponding optimization methods?