12 Multithreading Lock Optimization Dive Deeper Into the Optimization Methods of the Synchronized Synchronization Lock

12 Multithreading Lock Optimization Dive deeper into the optimization methods of the Synchronized synchronization lock #

Hello, I’m Liu Chao. Starting from this lesson, we will officially enter the third module: Multithreading Performance Optimization.

In concurrent programming, when multiple threads access the same shared resource, we must consider how to maintain data atomicity. Before JDK 1.5, Java relied on the Synchronized keyword to achieve locking functionality. Synchronized is an intrinsic lock implemented by the JVM, and lock acquisition and release are implicitly implemented by the JVM.

In JDK 1.5 version and later, the concurrent package introduced the Lock interface to implement locking functionality. It provides similar synchronization features as the Synchronized keyword, but requires explicit lock acquisition and release.

The Lock synchronization lock is implemented in Java, while Synchronized is implemented based on the underlying operating system’s Mutex Lock. Each lock acquisition and release operation incurs a context switch between user mode and kernel mode, thereby increasing system performance overhead. Therefore, in situations with intense lock contention, Synchronized lock performs poorly in terms of performance, and it is often referred to as a heavyweight lock.

Especially in the case where a single thread repeatedly applies for a lock, the performance of Synchronized lock in JDK 1.5 version is much worse than that of Lock. For example, in Dubbo, which implements communication based on Netty, when the consumer communicates with the server, it needs a thread to poll and listen for return messages asynchronously. When receiving messages, locks need to be used to ensure the atomicity of the request session. If we use Synchronized lock here, every time the same thread requests a lock resource, a context switch between user mode and kernel mode will occur.

After JDK 1.6 version, Java has made significant optimizations to the Synchronized lock, and in some scenarios, its performance has even surpassed that of the Lock synchronization lock. In this lesson, let’s take a look at the optimizations that Synchronized lock has undergone and how it achieved performance improvements.

Implementation Principle of Synchronized Lock #

Before understanding the optimization of Synchronized lock, let’s first take a look at its underlying implementation principle, which will help us better understand the following content.

Usually, there are two ways to implement a synchronized lock with the Synchronized keyword: one is by modifying a method, and the other is by modifying a code block. Here are two ways to lock with synchronized:

When the keyword is used to modify an instance method, the lock is the current instance.

public synchronized void method1() {
    // code
}

When the keyword is used to modify a code block, the lock is the object inside the parentheses.

public void method2() {
    Object o = new Object();
    synchronized (o) {
        // code
    }
}

Now we can decompile the code to see the specific bytecode implementation. By running the decompilation command below, we can output the bytecode we want:

javac -encoding UTF-8 SyncTest.java  // Compile class file first

javap -v SyncTest.class // Print bytecode with javap

By examining the output bytecode, you will find that Synchronized implements synchronization when modifying a synchronized code block using the monitorenter and monitorexit instructions. After entering the monitorenter instruction, the thread will hold the Monitor object, and after exiting the monitorenter instruction, the thread will release the Monitor object.

public void method2();
    descriptor: ()V
    flags: ACC_PUBLIC
    Code:
      stack=2, locals=4, args_size=1
         0: new           #2                  // class java/lang/Object
         3: dup
         4: invokespecial #1                  // Method java/lang/Object."<init>":()V
         7: astore_1
         8: aload_1
         9: dup
        10: astore_2
        11: monitorenter // monitorenter instruction
        12: aload_2
        13: monitorexit  // monitorexit instruction
        14: goto          22
        17: astore_3
        18: aload_2
        19: monitorexit
        20: aload_3
        21: athrow
        22: return
      Exception table:
         from    to  target type
            12    14    17   any
            17    20    17   any
      LineNumberTable:
        line 18: 0
        line 19: 8
        line 21: 12
        line 22: 22
      StackMapTable: number_of_entries = 2
        frame_type = 255 /* full_frame */
          offset_delta = 17
          locals = [ class com/demo/io/SyncTest, class java/lang/Object, class java/lang/Object ]
          stack = [ class java/lang/Throwable ]
        frame_type = 250 /* chop */
          offset_delta = 4

Now let’s take a look at the bytecode of synchronized methods. You will notice that when Synchronized modifies a synchronized method, there are no monitorenter and monitorexit instructions, but an ACC_SYNCHRONIZED flag appears.

This is because the JVM uses the ACC_SYNCHRONIZED access flag to distinguish whether a method is a synchronized method. When a method is called, the call instruction checks whether the method is set with the ACC_SYNCHRONIZED access flag. If this flag is set, the executing thread will first hold the Monitor object, and then execute the method. During the execution of the method, other threads will not be able to acquire the Monitor object. After the method completes, the Monitor object is released.

public synchronized void method1();
    descriptor: ()V
    flags: ACC_PUBLIC, ACC_SYNCHRONIZED // ACC_SYNCHRONIZED flag
    Code:
      stack=0, locals=1, args_size=1
         0: return
      LineNumberTable:
        line 8: 0

Based on the above source code, let’s take a look at how synchronized methods implement lock principles.

Synchronization in the JVM is implemented based on entering and exiting the Monitor object of the operating system. Each object instance has a Monitor, and the Monitor can be created and destroyed together with the object. The Monitor is implemented by ObjectMonitor, and ObjectMonitor is implemented by the C++ file “ObjectMonitor.hpp”, as shown below:

ObjectMonitor() {
   _header = NULL;
   _count = 0;
   _waiters = 0,
   _recursions = 0;
   _object = NULL;
   _owner = NULL;
   _WaitSet = NULL;
   _WaitSetLock = 0 ;
   _Responsible = NULL ;
   _succ = NULL ;
   _cxq = NULL ;
   FreeNext = NULL ;
   _EntryList = NULL ;
   _SpinFreq = 0 ;
   _SpinClock = 0 ;
   OwnerIsThread = 0 ;
}

When multiple threads access a synchronized code section at the same time, these threads will be stored in the EntryList collection first. Threads in the blocked state will be added to this list. Next, when a thread acquires the Monitor object of an object, the Monitor relies on the underlying operating system’s Mutex Lock to implement mutual exclusion. If the thread successfully acquires the Mutex, it holds the Mutex, and other threads will be unable to acquire the Mutex.

If a thread calls the wait() method, it will release the current Mutex and enter the WaitSet collection, waiting to be awakened next. If the current thread completes the method smoothly, it also releases the Mutex.

After reading the above explanation, I believe you have a deep understanding of the implementation principle of synchronized locks. In summary, in this implementation method, the Monitor relies on the operating system to implement it, which involves switching between user level and kernel level, resulting in increased performance overhead.

Lock Upgrade Optimization #

In order to improve performance, JDK 1.6 introduced the concepts of biased lock, lightweight lock, and heavyweight lock to reduce the context switching caused by lock contention. The newly implemented Java object header is responsible for the lock upgrade functionality.

When a Java object is synchronized with the Synchronized keyword to become a synchronization lock, a series of upgrade operations around this lock are related to the Java object header.

Java Object Header #

In JDK 1.6 JVM, the object instance in the heap memory is divided into three parts: the object header, instance data, and alignment padding. The Java object header consists of a mark word, a pointer to the class, and an array length.

The mark word records information about the object and the lock. The length of the mark word in a 64-bit JVM is 64 bits. Let’s take a look at the storage structure of a 64-bit JVM together. As shown in the following figure:

The lock upgrade functionality mainly relies on the lock flag and release biased lock flag in the mark word. Synchronized synchronization locks start with biased locks, and as competition becomes more intense, biased locks are upgraded to lightweight locks and ultimately to heavyweight locks. Let’s follow this optimization path to see the specific content.

1. Biased Lock #

Biased locks are mainly used to optimize the competition for the same lock by the same thread multiple times. In some cases, most of the time is spent on the same thread competing for lock resources. For example, in the scenario of creating a thread and performing loop listening in the thread, or when a single thread operates a thread-safe collection, the same thread needs to acquire and release the lock each time, which will cause a context switch between user mode and kernel mode.

The purpose of biased locks is that when a thread accesses the same synchronization code or method again, the thread only needs to check the ID pointed to by the biased lock in the object header’s mark word to determine whether there is a biased lock pointing to it. There is no need to enter the monitor to compete for the lock. When an object is used as a synchronization lock and a thread grabs the lock, the lock flag is still 01, the “Is Biased Lock” flag is set to 1, and the thread ID that grabbed the lock is recorded, indicating that it has entered the biased lock state.

Once there is competition for lock resources by other threads, the biased lock will be revoked. Revoking a biased lock requires waiting for a global safe point, pausing the thread holding the lock, and checking whether the thread is still executing the method. If it is, the lock is upgraded; otherwise, it is preempted by another thread.

The red line part in the following figure shows the process of biased lock acquisition and revocation:

Therefore, in high-concurrency scenarios where a large number of threads compete for the same lock resource, biased locks will be revoked. This will undoubtedly increase performance overhead when biased locks are enabled. In this case, we can optimize system performance by disabling biased locks by adding JVM parameters. The sample code is as follows:

-XX:-UseBiasedLocking // Disable biased locks (enabled by default)

-XX:+UseHeavyMonitors // Set heavyweight locks

2. Lightweight Lock #

When another thread competes to acquire this lock, since the lock is already a biased lock, when it finds that the thread ID in the mark word of the object header is not its own thread ID, it will perform a CAS operation to acquire the lock. If the acquisition is successful, it directly replaces the thread ID in the mark word with its own ID, and the lock remains in the biased lock state. If acquiring the lock fails, it means that the current lock has some level of contention, and the biased lock will be upgraded to a lightweight lock.

Lightweight locks are suitable for scenarios where threads alternate between executing synchronized blocks, and most locks do not have long-term contention throughout the entire synchronization period.

The red line part in the following figure shows the process of upgrading to a lightweight lock and its operations:

3. Spin Lock and Heavyweight Lock #

When a lightweight lock fails to be acquired by performing a CAS operation, the thread will be suspended and enter a blocking state. If the thread holding the lock releases the resource within a very short period of time, the thread in the blocking state will undoubtedly need to reapply for the lock.

JVM provides a spin lock that can continuously try to acquire the lock by spinning, avoiding thread suspension and blocking. This is based on the fact that in most cases, threads do not hold the lock for too long, as thread suspension and blocking may not be worthwhile.

Starting from JDK 1.7, spin locks are enabled by default, and the number of spins is determined by JVM settings. I do not recommend setting a large number of retries, as CAS retry operations mean long-term CPU occupation.

If a spin lock retry fails to acquire the lock, the synchronization lock will be upgraded to a heavyweight lock, and the lock flag will be changed to 10. In this state, threads that fail to acquire the lock will enter the monitor and be blocked in the _WaitSet queue.

The red line part in the following figure shows the process of upgrading to a heavyweight lock after spinning:

In scenarios where lock contention is not intense and the lock is held for a very short period of time, spin locks can improve system performance. Once lock contention is intense or the lock is held for too long, spin locks will cause many threads to be constantly in CAS retry state, occupying CPU resources, which will increase system performance overhead instead. Therefore, the use of spin locks and heavyweight locks should be based on the actual scenario.

In high-load and high-concurrency scenarios, we can optimize system performance by disabling spin locks by setting JVM parameters. The sample code is as follows:

-XX:-UseSpinning // Disable spin lock optimization (enabled by default)
-XX:PreBlockSpin // Modify the default spin count. Removed since JDK 1.7 and controlled by the JVM

Dynamic Compilation Implementing Lock Elimination / Lock Coarsening #

In addition to lock upgrading optimization, Java also uses compiler optimization for locks. When the JIT compiler dynamically compiles a synchronized block, it utilizes a technique called escape analysis to determine if the lock object used in the synchronized block can only be accessed by one thread and has not been published to other threads.

If confirmed, the JIT compiler will not generate machine code for lock acquisition and release represented by synchronized, thus eliminating the use of locks. Starting from Java 7, manual configuration is not required anymore, as this operation can be automatically implemented.

Similarly, lock coarsening is when the JIT compiler dynamically compiles and finds that several adjacent synchronized blocks are using the same lock instance. The JIT compiler will merge these synchronized blocks into one large synchronized block to avoid the performance overhead caused by a thread “repeatedly acquiring and releasing the same lock”.

Reducing Lock Granularity #

In addition to internal lock optimization and compiler optimization, we can also achieve lock optimization by code layer. Reducing lock granularity is a common method.

When our lock object is an array or a queue, intense competition for one object can occur, resulting in the lock being upgraded to a heavyweight lock. We can consider splitting an array or queue object into multiple smaller objects to reduce lock contention and improve parallelism.

The most classic example of reducing lock granularity is the version of ConcurrentHashMap implemented before JDK 1.8. As we know, HashTable is implemented based on an array + linked list. Therefore, when multiple read and write operations are performed concurrently, there is intense competition for lock resources, which can create performance bottlenecks. ConcurrentHashMap cleverly uses a segmented lock called Segment to reduce lock contention, as shown in the following image:

Summary #

JVM introduced the tiered locking mechanism in JDK1.6 to optimize Synchronized. When a thread acquires a lock, the object lock will first become biased lock, which is done to optimize the problem of context switching between user mode and kernel mode caused by repeated acquisitions by the same thread. Secondly, if multiple threads compete for lock resources, the lock will be upgraded to a lightweight lock. It is suitable for holding the lock for a short time and there is alternating switching between locks. Biased locks also use spin locks to avoid frequent switching between user mode and kernel mode by threads, greatly improving system performance. However, if the lock competition is too intensive, the synchronized lock will be upgraded to a heavyweight lock.

Reducing lock contention is the key to optimizing the Synchronized lock. We should try to keep the Synchronized lock in a lightweight or biased state in order to improve its performance. Reducing lock granularity to reduce lock contention is also a commonly used optimization method. In addition, we can also improve the success rate of acquiring lock resources during spinning by reducing the holding time of the lock, avoiding the upgrade of the Synchronized lock to a heavyweight lock.

In this lesson, we focused on understanding the optimization of the Synchronized lock. Due to word limitations and for a better understanding of the content, I have split the content of Lecture 12 in the table of contents into two lessons. In the next lesson, I will focus on explaining the optimization methods of the Lock synchronization lock.

Thought Exercise #

What is the difference between using the synchronized keyword to modify an ordinary method versus a static method?

// Modifying an ordinary method
public synchronized void method1() {
    // code
}

// Modifying a static method
public synchronized static void method2() {
    // code
}

Answer:

When the synchronized keyword is used to modify an ordinary method, it means that only one thread can execute that method at a time for a specific instance of the class. This ensures that multiple threads cannot concurrently access or modify the object’s state through that method, preventing potential data inconsistency or race conditions.

On the other hand, when the synchronized keyword is used to modify a static method, it means that only one thread can execute that method at a time for the entire class. This ensures that multiple threads cannot concurrently access or modify any static variables or shared resources associated with the class, preventing the same potential issues as with ordinary methods.

In summary, the difference lies in the lock scope: synchronized on an ordinary method locks on the instance of the class, while synchronized on a static method locks on the entire class.