20 Technique Series, Where Exactly Is Netty's Fast Thread Local Faster Than Thread Local

20 Technique series, where exactly is Netty’s FastThreadLocal faster than ThreadLocal #

In the previous few lessons on source code analysis, we have come across FastThreadLocal in the source code. As the name suggests, FastThreadLocal is a higher-performance communication framework compared to JDK’s ThreadLocal, in the context of Netty being a high-performance network communication framework. But what exactly makes FastThreadLocal faster than ThreadLocal? In this lesson, let’s explore the secrets behind the high performance of FastThreadLocal.

Note: This article refers to Netty version 4.1.42.Final.

Basic principles of JDK ThreadLocal #

JDK ThreadLocal is not only a frequently asked interview question, but also a commonly used tool in everyday work. Therefore, let’s first understand the implementation principles of Java’s native ThreadLocal. This will help us better compare and understand Netty’s FastThreadLocal.

If you need to isolate variables between multiple threads, or share them among classes and methods within the same thread, ThreadLocal comes into play. ThreadLocal can be understood as a thread-local variable. It is a very important class in Java concurrent programming. ThreadLocal creates a copy of the variable in each thread, which can only be accessed by the current thread. The copies are isolated between multiple threads, and the variable cannot be shared among them. This way, when each thread modifies its own copy of the variable, it does not affect other threads.

Next, let’s see how ThreadLocal is used through an example:

public class ThreadLocalTest {

    private static final ThreadLocal<String> THREAD_NAME_LOCAL = ThreadLocal.withInitial(() -> Thread.currentThread().getName());

    private static final ThreadLocal<TradeOrder> TRADE_THREAD_LOCAL = new ThreadLocal<>();

    public static void main(String[] args) {

        for (int i = 0; i < 2; i++) {

            int tradeId = i;

            new Thread(() -> {

                TradeOrder tradeOrder = new TradeOrder(tradeId, tradeId % 2 == 0 ? "Paid" : "Unpaid");

                TRADE_THREAD_LOCAL.set(tradeOrder);

                System.out.println("Thread name: " + THREAD_NAME_LOCAL.get());

                System.out.println("Trade order info: " + TRADE_THREAD_LOCAL.get());

            }, "thread-" + i).start();

        }

    }

    static class TradeOrder {
        long id;
        String status;

        public TradeOrder(int id, String status) {
            this.id = id;
            this.status = status;
        }

        @Override
        public String toString() {
            return "id=" + id + ", status=" + status;
        }
    }
}

In the above example, two ThreadLocal variables, THREAD_NAME_LOCAL and TRADE_THREAD_LOCAL, are created to record the current thread name and order transaction information, respectively. ThreadLocal supports generics, so THREAD_NAME_LOCAL and TRADE_THREAD_LOCAL store data of types String and TradeOrder, respectively. You can set and retrieve instances of ThreadLocal using the set()/get() methods. Now, let’s see the output of the example code:

Thread name: thread-0
Trade order info: id=0, status=Paid
Thread name: thread-1
Trade order info: id=1, status=Unpaid
tradeOrder info: id=1, status=unpaid

tradeOrder info: id=0, status=paid

From the code, we can see that even though thread-1 and thread-2 are operating on the same ThreadLocal object, they obtained different thread names and order transaction information. So how can multiple ThreadLocal objects exist within a thread, and how are they stored and retrieved by each ThreadLocal object?

Next, let’s take a look at the implementation principle of ThreadLocal. Since each thread accessing a ThreadLocal variable has its own independent instance copy, it is easy to think of a solution to maintain a Map in ThreadLocal to record the mapping relationship between threads and instances. When adding or destroying threads, the mapping relationship in the Map needs to be updated. However, since there may be concurrent modifications by multiple threads, the Map needs to be thread-safe. Is this how JDK implements ThreadLocal? The answer is NO. In order to avoid locking during concurrent modification of the Map in high-concurrency scenarios, JDK adopts the opposite design approach. Starting with the Thread, a Map is maintained in the Thread to record the mapping relationship between ThreadLocal and the instance. Therefore, within the same thread, the Map does not need to be locked. The relationship between the thread Thread and ThreadLocal and ThreadLocalMap in the example code can be represented by the following figure:

Drawing 0.png

So how is the Map that maintains the mapping relationship implemented inside the Thread? From the source code, we can see that the Thread uses the internal class ThreadLocalMap of ThreadLocal, so the relationship between Thread, ThreadLocal, and ThreadLocalMap can be represented by the following figure:

Drawing 1.png

In order to better understand ThreadLocal, it is necessary to understand the internal implementation of ThreadLocalMap. ThreadLocalMap is a hash table implemented using linear probing and uses an array to store data. As shown in the figure below, ThreadLocalMap initializes an Entry array with a length of 16, and each Entry object is used to store the key-value pair. Different from HashMap, the key of Entry is the ThreadLocal object itself, and the value is the specific value that the user needs to store.

Drawing 2.png

How does it solve hash collisions when calling ThreadLocal.set() to add Entry objects? This requires an understanding of the implementation principle of linear probing. When initializing, each ThreadLocal has a hash value threadLocalHashCode. For each additional ThreadLocal, the hash value increases by a fixed magic number HASH_INCREMENT = 0x61c88647. Why is the magic number 0x61c88647 used? Experimental results have shown that by accumulating threadLocalHashCode generated by adding 0x61c88647 and taking the result modulo 2 to the power of N, the results can be evenly distributed in an array of size 2 to the power of N. With the basis of threadLocalHashCode, let’s explain how linear probing is implemented through the following table.

图片2.png

For easier understanding, we use a set of simple data to simulate how ThreadLocal.set() resolves hash collisions.

  1. threadLocalHashCode = 4, threadLocalHashCode & 15 = 4; At this time, the data should be placed at index 4 of the array. There is no data at index 4, so it can be stored.
  2. threadLocalHashCode = 19, threadLocalHashCode & 15 = 4; But there is already data at index 4. If the Entry being added and the Entry already present at index 4 have the same key, the value of the Entry at that position will be overwritten with the new value. Let’s assume that the keys are different, so we need to move one position forward. There is no conflict at index 5, so it can be stored.
  3. threadLocalHashCode = 33, threadLocalHashCode & 15 = 3; There is already data at index 3. Move one position forward, there is still data at index 4, so continue to search forward, and find that there is no data at index 6, so it can be stored.

The process of ThreadLocal.get() is similar. It also locates the array index based on the value of threadLocalHashCode, and then determines whether the current position Entry object and the key of the Entry object being queried are the same. If they are different, it continues to search downwards. It can be seen that when ThreadLocal.set()/get() is used in data-intensive scenarios, hash collisions are prone to occur, and it takes O(n) time complexity to resolve the collision, which is inefficient.

Next, let’s talk about the design principle of Entry in ThreadLocalMap. Entry inherits WeakReference, and the key of Entry is a weak reference, while the value is a strong reference. During JVM garbage collection, once the weakly referenced object is discovered, it will be collected regardless of whether the memory is sufficient or not. So why does the key of Entry need to be designed as a weak reference? Let’s imagine, if the key is a strong reference, even when ThreadLocal is no longer used, there is still a strong reference to ThreadLocal in ThreadLocalMap, so GC cannot collect it, resulting in memory leaks.

Although the key of Entry is designed as a weak reference, when ThreadLocal is no longer used and is GC’d, the key of Entry in ThreadLocalMap may become NULL, so the value of Entry will always hold a strong reference to the data and cannot be released until the thread is destroyed. How can we avoid memory leaks in ThreadLocalMap? ThreadLocal has already taken some protective measures for us. When executing ThreadLocal.set()/get(), ThreadLocal will clear the Entry objects with a NULL key in ThreadLocalMap so that they can be collected by GC. In addition, when a ThreadLocal object in a thread is no longer used, immediately call the remove() method to delete the corresponding Entry object. In the case of exceptions, remember to clean up in the finally block and maintain a good coding habit. We have already introduced the basic principles of ThreadLocal in JDK. Since ThreadLocal is already mature and widely used in daily development, why does Netty need to implement its own FastThreadLocal? Is it really much faster than ThreadLocal? Let’s explore it together.

Why is FastThreadLocal fast? #

The implementation of FastThreadLocal is very similar to ThreadLocal. Netty has tailored two important classes, FastThreadLocalThread and InternalThreadLocalMap, for FastThreadLocal. Let’s take a look at how these two classes are implemented.

FastThreadLocalThread is a wrapper for the Thread class. Each thread corresponds to an instance of InternalThreadLocalMap. Only when FastThreadLocal and FastThreadLocalThread are used together can the performance advantages of FastThreadLocal be realized. Let’s first look at the source code definition of FastThreadLocalThread:

public class FastThreadLocalThread extends Thread {

    private InternalThreadLocalMap threadLocalMap;

    // other code omitted

}

It can be seen that FastThreadLocalThread mainly extends the InternalThreadLocalMap field. We can guess that FastThreadLocalThread mainly uses InternalThreadLocalMap to store data, instead of ThreadLocalMap in Thread. So to understand the high performance of FastThreadLocalThread, we must understand the design principles of InternalThreadLocalMap.

In the previous section, we mentioned an important drawback of ThreadLocal, which is that ThreadLocalMap uses linear probe to solve hash conflicts, resulting in slower performance. So how does InternalThreadLocalMap optimize it? Let’s take a look at the internal construction of InternalThreadLocalMap:

public final class InternalThreadLocalMap extends UnpaddedInternalThreadLocalMap {

    private static final int DEFAULT_ARRAY_LIST_INITIAL_CAPACITY = 8;

    private static final int STRING_BUILDER_INITIAL_SIZE;

    private static final int STRING_BUILDER_MAX_SIZE;

    public static final Object UNSET = new Object();

    private BitSet cleanerFlags;
    private InternalThreadLocalMap() {

        super(newIndexedVariableTable());

    }
```java
private static Object[] newIndexedVariableTable() {

    Object[] array = new Object[32];

    Arrays.fill(array, UNSET);

    return array;

}
public static int nextVariableIndex() {

    int index = nextIndex.getAndIncrement();

    if (index < 0) {

        nextIndex.decrementAndGet();

        throw new IllegalStateException("too many thread-local indexed variables");

    }

    return index;

}

// Omitted other code

}

class UnpaddedInternalThreadLocalMap {

    static final ThreadLocal<InternalThreadLocalMap> slowThreadLocalMap = new ThreadLocal<InternalThreadLocalMap>();

    static final AtomicInteger nextIndex = new AtomicInteger();
    Object[] indexedVariables;

    UnpaddedInternalThreadLocalMap(Object[] indexedVariables) {

        this.indexedVariables = indexedVariables;

    }

    // Omitted other code

}

From the internal implementation of InternalThreadLocalMap, it can be seen that it uses an array as storage, just like ThreadLocalMap. However, InternalThreadLocalMap does not use linear probing to solve hash conflicts. Instead, it assigns an array index "index" when FastThreadLocal is initialized. The value of "index" is incrementing in order and is obtained by calling InternalThreadLocalMap.nextVariableIndex() method. When reading or writing data, it directly locates the position of FastThreadLocal through the array index "index", with a time complexity of O(1). If the array index becomes very large, the array will also become larger. Therefore, FastThreadLocal improves performance by trading space for time. The relationship between InternalThreadLocalMap, index, and FastThreadLocal is described in the following diagram:

![Drawing 3.png](../images/Ciqc1F_qw1KAUXO0AAMZJ_Hk4dQ099.png)

From the internal structure diagram of FastThreadLocal we saw above, what are the differences between it and ThreadLocal? FastThreadLocal uses an Object array to replace the Entry array. Object[0] stores a Set<Map.Entry<FastThreadLocal<?>, Object>> collection. Starting from array index 1, it stores values directly instead of using the key-value pair form to store, as ThreadLocal does.

Assuming we have a set of data to be added to the array, which are value1, value2, value3, and value4, and the corresponding array indexes initialized by FastThreadLocal are 1, 2, 3, and 4. The diagram below shows this:

![Drawing 4.png](../images/Ciqc1F_qw1qAYzsdAAEbsTk70Is389.png)

So far, we have a basic understanding of FastThreadLocal. Next, we will analyze the implementation principles of FastThreadLocal in detail through the specific source code.

### Source Code Analysis of FastThreadLocal

Before explaining the source code, let's look back at the ThreadLocal example in the previous section. If we replace ThreadLocal with FastThread, how should we use it?
```java
public class FastThreadLocalTest {

    private static final FastThreadLocal<String> THREAD_NAME_LOCAL = new FastThreadLocal<>();

    private static final FastThreadLocal<TradeOrder> TRADE_THREAD_LOCAL = new FastThreadLocal<>();

    public static void main(String[] args) {

        for (int i = 0; i < 2; i++) {

            int tradeId = i;

            String threadName = "thread-" + i;

            new FastThreadLocalThread(() -> {

                THREAD_NAME_LOCAL.set(threadName);

                TradeOrder tradeOrder = new TradeOrder(tradeId, tradeId % 2 == 0 ? "已支付" : "未支付");

                TRADE_THREAD_LOCAL.set(tradeOrder);

                System.out.println("threadName: " + THREAD_NAME_LOCAL.get());

                System.out.println("tradeOrder info:" + TRADE_THREAD_LOCAL.get());

            }, threadName).start();

        }

    }

}

As you can see, the usage of FastThreadLocal is almost the same as ThreadLocal. You just need to replace Thread and ThreadLocal with FastThreadLocalThread and FastThreadLocal respectively. Netty has done an excellent job in terms of ease of use. Let’s focus on the set()/get() methods in the example and analyze it in depth.

First, let’s look at the source code of FastThreadLocal.set():

public final void set(V value) {

    if (value != InternalThreadLocalMap.UNSET) { // 1. Check if the value is the default value

        InternalThreadLocalMap threadLocalMap = InternalThreadLocalMap.get(); // 2. Get the InternalThreadLocalMap of the current thread

        setKnownNotUnset(threadLocalMap, value); // 3. Replace the data in InternalThreadLocalMap with the new value

    } else {

        remove();

    }

}

Although the set() method of FastThreadLocal has only a few lines of code in its entry, the internal logic is quite complex. Let’s start by breaking down the code step by step. The set() process mainly consists of the following three steps:

  1. Check if the value is the default value. If it is, call the remove() method directly. At this point, we don’t know the relationship between the default value and the remove() method, so let’s analyze remove() later.
  2. If the value is not the default value, the next step is to get the InternalThreadLocalMap of the current thread.
  3. Then replace the corresponding data in InternalThreadLocalMap with the new value.

First, let’s look at the InternalThreadLocalMap.get() method, the source code is as follows:

public static InternalThreadLocalMap get() {

    Thread thread = Thread.currentThread();

    if (thread instanceof FastThreadLocalThread) { // Check if the current thread is of type FastThreadLocalThread

        return fastGet((FastThreadLocalThread) thread);

    } else {

        return slowGet();

    }

}

private static InternalThreadLocalMap fastGet(FastThreadLocalThread thread) {

    InternalThreadLocalMap threadLocalMap = thread.threadLocalMap(); // Get the threadLocalMap attribute of FastThreadLocalThread

    if (threadLocalMap == null) {

        thread.setThreadLocalMap(threadLocalMap = new InternalThreadLocalMap());
        variablesToRemove = Collections.newSetFromMap(new IdentityHashMap<FastThreadLocal<?>, Boolean>());

Collections.newSetFromMap() 方法创建了一个基于 IdentityHashMap 的 Set 集合。IdentityHashMap 是一种特殊的 Map,它通过对象的引用地址来判断两个键是否相等,而不是通过 equals() 方法。这样做的目的是为了避免不同的 FastThreadLocal 对象在引用地址不相同的情况下被误删。

继续看 addToVariablesToRemove() 的代码:

        boolean added = variablesToRemove.add(variable);
        
        if (added) {
            threadLocalMap.setIndexedVariable(variablesToRemoveIndex, variablesToRemove);
        }
}

通过 variablesToRemove.add(variable) 将 FastThreadLocal 对象添加到 Set 中,并返回添加是否成功的结果。如果添加成功,则将 Set 对象存放到 InternalThreadLocalMap 数组下标为 0 的位置。

至此,FastThreadLocal 的 set() 方法的流程就完整地分析完了。总结一下 set() 的主要步骤:

  1. 获取当前线程的 InternalThreadLocalMap。
  2. 根据 FastThreadLocal 的索引,设置数据值。
  3. 将 FastThreadLocal 对象保存到待清理的 Set 中。

接下来,我们将分析 FastThreadLocal 的 get() 方法的源码实现。

// Set the Set collection into the position of index 0 in the array

threadLocalMap.setIndexedVariable(variablesToRemoveIndex, variablesToRemove);

Otherwise, if the element is not UNSET or does not exist, directly cast and obtain the Set collection

variablesToRemove = (Set<FastThreadLocal<?>>) v;

Add the FastThreadLocal to the Set collection

variablesToRemove.add(variable);

The variablesToRemoveIndex is a variable that is declared as static final. When FastThreadLocal is initialized, variablesToRemoveIndex is assigned a value of 0. In InternalThreadLocalMap, it first finds the element at index 0 in the array. If the element is the default object UNSET or does not exist, it creates a Set collection of type FastThreadLocal and fills it into the position of index 0 in the array. If the first element of the array is not the default object UNSET, it means that the Set collection has already been filled, so it can be directly cast to obtain the Set collection. This explains why the value of InternalThreadLocalMap starts storing from the position of index 1, because the position 0 is already occupied by the Set collection.

So why does InternalThreadLocalMap store a Set collection of type FastThreadLocal at position 0 in the array? Now let’s take a look at the remove() method again.

public final void remove() {
    remove(InternalThreadLocalMap.getIfSet());
}

public static InternalThreadLocalMap getIfSet() {
    Thread thread = Thread.currentThread();
    if (thread instanceof FastThreadLocalThread) {
        return ((FastThreadLocalThread) thread).threadLocalMap();
    }
    return slowThreadLocalMap.get();
}

public final void remove(InternalThreadLocalMap threadLocalMap) {
    if (threadLocalMap == null) {
        return;
    }
    Object v = threadLocalMap.removeIndexedVariable(index); // Remove the value at position index in the array
    removeFromVariablesToRemove(threadLocalMap, this); // Retrieve the Set collection from position 0 in the array and remove the current FastThreadLocal
    if (v != InternalThreadLocalMap.UNSET) {
        try {
            onRemoval((V) v); // Empty method, users can inherit and implement it
        } catch (Exception e) {
            PlatformDependent.throwException(e);
        }
    }
}

Before performing the remove operation, the getIfSet() method is called to obtain the current InternalThreadLocalMap. With the knowledge we have gained before, understanding the getIfSet() method is straightforward. If it is of type FastThreadLocalThread, it directly retrieves the threadLocalMap attribute from FastThreadLocalThread. If it is a regular Thread, it gets it from the slowThreadLocalMap of type ThreadLocal. Once the InternalThreadLocalMap is found, it will locate the element at position index in the array and replace it with the default object UNSET. Next, the current FastThreadLocal object needs to be cleaned up, and this is where the Set collection comes in handy. The InternalThreadLocalMap will retrieve the Set collection at position 0 in the array, and then remove the current FastThreadLocal. Finally, what does the onRemoval() method do? Netty has left it as an extension point and not implemented it. Users need to inherit from FastThreadLocal and implement this method to perform any post-removal operations.

So far, the process of the set() method of FastThreadLocal has been explained. Next, let’s move on to the implementation of the get() method, which is straightforward. The source code for get() method of FastThreadLocal is as follows:

public final V get() {
    InternalThreadLocalMap threadLocalMap = InternalThreadLocalMap.get();
    Object v = threadLocalMap.indexedVariable(index); // Retrieve the element at position index in the array
if (v != InternalThreadLocalMap.UNSET) {
    return (V) v;
}

return initialize(threadLocalMap); // If the obtained array element is the default object, perform initialization
public Object indexedVariable(int index) {
    Object[] lookup = indexedVariables;
    return index < lookup.length? lookup[index] : UNSET;
}
private V initialize(InternalThreadLocalMap threadLocalMap) {
    V v = null;
    try {
        v = initialValue();
    } catch (Exception e) {
        PlatformDependent.throwException(e);
    }
    
    threadLocalMap.setIndexedVariable(index, v);
    addToVariablesToRemove(threadLocalMap, this);
    
    return v;
}

First, based on whether the current thread is of type FastThreadLocalThread, find InternalThreadLocalMap and retrieve the element at index from the array. If the element at index is not the default object UNSET, it means that the position has already been filled with data, so it can be directly retrieved and returned. If the element at index is the default object UNSET, then the initialization operation needs to be performed. As seen below, the initialize() method will call the user-defined initialValue method to construct the object data to be stored.

private final FastThreadLocal<String> threadLocal = new FastThreadLocal<String>() {
    @Override
    protected String initialValue() {
        return "hello world";
    }
};

After constructing the user object data, it will be filled into the position at index in the array, and then the current FastThreadLocal object will be saved to the Set of variables to be removed. The entire process has already been introduced when analyzing FastThreadLocal.set(), so it will not be repeated here.

So far, we have finished analyzing the two core methods of FastThreadLocal, set() and get(). Now let’s delve into two additional questions.

Is FastThreadLocal really faster than ThreadLocal? The answer is not necessarily. Only threads of type FastThreadLocalThread will be faster. If it is a regular thread, it may actually be slower.

Does FastThreadLocal waste a lot of space? Although FastThreadLocal utilizes the trade-off between space and time, it was initially designed with the belief that there would not be a large number of FastThreadLocal objects, and the unused elements in the data only store references to the same default object, so it does not occupy too much memory space.

Summary #

In this lesson, we compared and introduced ThreadLocal and FastThreadLocal. Let’s summarize the advantages of FastThreadLocal.

  • Effective lookup: When locating data, FastThreadLocal can directly retrieve based on the array index, with a time complexity of O(1). On the other hand, the JDK’s native ThreadLocal is prone to hash conflicts when there is a large amount of data, and the linear probing method used to resolve hash conflicts requires continuous searching, resulting in lower efficiency. In addition, FastThreadLocal is more efficient and simple in data expansion compared to ThreadLocal. FastThreadLocal rounds up the index to the next power of 2 as the expanded capacity, and then copies the original data to the new array. In contrast, ThreadLocal needs to rehash after expansion because it uses a hash table.
  • Higher security: Improper use of the JDK’s native ThreadLocal may cause memory leaks, and it can only be resolved when the thread is destroyed. In the case of using a thread pool, ThreadLocal can only prevent memory leaks through active detection, resulting in certain overhead. However, FastThreadLocal not only provides the remove() method to actively remove objects, but in the thread pool scenario, Netty also encapsulates FastThreadLocalRunnable. FastThreadLocalRunnable will ultimately execute FastThreadLocal.removeAll() to clear all FastThreadLocal objects in the Set collection.

FastThreadLocal embodies Netty’s spirit of continuous improvement in high performance design. FastThreadLocal is just the tip of the iceberg. In the next lesson, we will continue exploring other efficient data structure techniques in Netty.