14 Multithreading Lock Optimization Using Optimistic Locks to Optimize Concurrent Operations

14 Multithreading Lock Optimization Using optimistic locks to optimize concurrent operations #

Hello, I’m Liu Chao.

In the previous two lectures, we discussed the synchronization mechanisms implemented by Synchronized and Lock. Both of these synchronization locks belong to pessimistic locking, which is the most intuitive way to ensure thread safety.

We know that in highly concurrent scenarios, intense lock competition with pessimistic locking can lead to thread blocking. A large number of blocked threads can cause system context switching and increase system performance overhead. Is it possible to implement a non-blocking locking mechanism to ensure thread safety? The answer is yes. Today, I will teach you the optimization methods of optimistic locking and show you how to use it to achieve its maximum value.

What is Optimistic Locking #

Before we start optimizing, let’s briefly review the definition of optimistic locking.

Optimistic locking, as the name suggests, means that when operating on shared resources, it always takes an optimistic attitude and believes that it can successfully complete the operation. However, in reality, when multiple threads are operating on a shared resource simultaneously, only one thread will succeed. What about the failed threads? They will not be suspended in the operating system like pessimistic locking, but rather they will just return. The system allows the failed threads to retry and also allows them to automatically give up and exit the operation.

Therefore, compared to pessimistic locking, optimistic locking does not introduce issues such as deadlocks and starvation. The mutual influence between threads is also much smaller than that of pessimistic locking. More importantly, optimistic locking does not have the system overhead caused by competition, so it performs better in terms of performance.

Implementation Principle of Optimistic Locking #

I believe you have some understanding of the content above. Now let’s take a look at the implementation principle of optimistic locking, which will help us summarize optimization methods from the fundamental level.

CAS (Compare and Swap) is the core algorithm for implementing optimistic locking. It consists of three parameters: V (the variable to be updated), E (the expected value), and N (the new value).

Only when the variable to be updated is equal to the expected value, the variable to be updated will be set to the new value. If the update value is different from the expected value, it means that another thread has already updated the variable to be updated. In this case, the current thread does not perform any operation and returns the real value of V.

1. How CAS Implements Atomic Operations #

In the JDK’s concurrent package, classes under the atomic path are all implemented based on CAS. AtomicInteger is a thread-safe integer class based on CAS. Let’s understand how CAS is used to implement atomic operations through the source code.

We can see that the increment method getAndIncrement of AtomicInteger uses the getAndAddInt method of Unsafe. Obviously, AtomicInteger relies on the native method Unsafe class, and the operating methods in the Unsafe class will call CPU’s underlying instructions to implement atomic operations.

// Update the value based on CAS operation
public final boolean compareAndSet(int expect, int update) {
    return unsafe.compareAndSwapInt(this, valueOffset, expect, update);
}

// Increment by 1 based on CAS operation
public final int getAndIncrement() {
    return unsafe.getAndAddInt(this, valueOffset, 1);
}

// Decrement by 1 based on CAS operation
public final int getAndDecrement() {
    return unsafe.getAndAddInt(this, valueOffset, -1);
}

2. How Processors Implement Atomic Operations #

CAS calls the underlying instructions of the processor to implement atomic operations, but how do processors implement atomic operations at the underlying level?

The communication speed between processors and physical memory is much slower than the processing speed between processors, so the processor has its own internal cache. As shown in the figure below, frequently used memory data is cached in the L1, L2, and L3 caches of the processor to speed up frequent reads.

In general, a single-core processor can guarantee the basic atomicity of memory operations. When a thread reads a byte, all processes and threads see the byte in the same cache, and other threads cannot access the memory address of this byte.

But now servers are typically multiprocessors, and each processor is multicore. Each processor maintains a piece of byte-level memory, and each core maintains a piece of byte-level cache. At this time, cache inconsistency exists in multi-threaded concurrency, which leads to data inconsistency.

At this point, the processor provides two mechanisms, bus locking and cache locking, to guarantee the atomicity of complex memory operations.

When the processor needs to operate a shared variable, it sends a Lock signal on the bus. At this time, other processors cannot operate the shared variable anymore, and this processor has exclusive access to the variable in shared memory. However, bus locking may block other processors’ requests to access the shared variable, thereby causing a large amount of blocking and increasing the performance overhead of the system.

Therefore, later processors have provided the cache locking mechanism. That is, when a processor operates a shared variable in the cache, it notifies other processors to give up storing or re-read the shared resource. The latest processors currently support cache locking mechanisms.

Optimizing CAS Optimistic Locking #

Although optimistic locking is superior to pessimistic locking in terms of concurrency performance, the likelihood of CAS failure increases in write-heavy scenarios. If the CAS operation is not abandoned, it requires looping to retry CAS, which undoubtedly consumes CPU for a long period of time.

In Java 7, we can see from the following code that the getAndSet method of AtomicInteger uses a for loop to continuously retry CAS operations. If it is not successful for a long time, it will impose a significant execution overhead on the CPU. In Java 8, although the for loop has been removed, we can find that this loop is actually encapsulated in the Unsafe class when decompiling it, so the CPU execution overhead still exists.

public final int getAndSet(int newValue) {
    for (;;) {
        int current = get();
        if (compareAndSet(current, newValue))
            return current;
    }
}

In JDK 1.8, Java provides a new atomic class called LongAdder. LongAdder performs better than AtomicInteger and AtomicLong in highly concurrent scenarios, at the cost of consuming more memory space.

The principle of LongAdder is to reduce the concurrency of operations on shared variables, that is, to distribute the operational pressure on a single shared variable to multiple variable values. It spreads the value of each write thread competing for the variable to an array, where different threads will hit different slots in the array. Each thread only performs CAS operations on the value in its own slot. Finally, when reading the value, it adds up the atomic operation’s shared variable with the scattered values in the array, returning an approximately accurate numerical value.

Internally, LongAdder consists of a base variable and a cell[] array. When there is only one write thread and no competition, LongAdder will directly use the base variable as the atomic variable and modify it through CAS operations. When there are multiple write threads competing, in addition to the one thread occupying the base variable, the other threads will write the modified variable into their own slot in the cell[] array. The final result can be calculated using the following formula:

We can see that the returned value after the operation of LongAdder is only an approximately accurate numerical value. However, LongAdder ultimately returns an accurate numerical value. Therefore, in scenarios where high real-time performance is required, LongAdder cannot replace AtomicInteger or AtomicLong.

Summary #

In daily development, the most common scenario for using optimistic locks is database update operations. In order to ensure the atomicity of operating the database, we often define a version number for each piece of data and obtain it before updating. When updating the database, we also need to check whether the obtained version number has been updated. If not, we can proceed with the operation.

CAS optimistic locks are often limited in everyday use. They can only ensure the atomicity of single variable operations. When it comes to multiple variables, CAS is powerless. However, as mentioned in the previous two lectures, pessimistic locks can achieve this by locking the entire code block.

In high-concurrency scenarios where writes are greater than reads, most threads’ atomic operations will fail. The threads that fail will continue to retry the CAS atomic operation, resulting in a large number of threads occupying CPU resources for a long time, which imposes a significant performance cost on the system. In JDK 1.8, Java introduced a new atomic class called LongAdder, which uses a space-time trade-off method to solve the above-mentioned problem.

In Lectures 11-13, I explained in detail the synchronized lock implemented based on the JVM, the lock implemented by AQS, and the optimistic lock implemented by CAS. I believe you are also curious about which lock performs best. Now let’s compare the performance of these three different implementation methods.

Given that performance comparison tests without actual business scenarios are meaningless, we can test them in three scenarios: “read more, write less,” “read less, write more,” and “read and write similar.” Moreover, since the performance of locks is also related to the intensity of competition, we will also perform performance tests on these three locks under different levels of competition.

Considering the above conditions, I will test five locks, including Synchronized, ReentrantLock, ReentrantReadWriteLock, StampedLock, and the optimistic lock LongAdder, in four different modes.

I briefly explain here: I have created four sets of tests using different combinations of read and write threads under different levels of competition. The test code uses a concurrent counter for calculation. The read threads read the value of the counter, and the write threads manipulate and update the counter value. The runtime environment is a 4-core i7 processor. The results have been provided, and you can click Github to view and download the specific test code.

From the above results, we can see that: in scenarios where reads are greater than writes, ReentrantReadWriteLock, StampedLock, and the optimistic lock have the best read and write performance; in scenarios where writes are greater than reads, the optimistic lock performs the best, and the performance of the other 4 locks is similar; in scenarios where reads and writes are similar, the performance of the two read-write locks and the optimistic lock is better than Synchronized and ReentrantLock.

Thought Question #

What does the ABA problem refer to when we are using CAS operations?