31 Why Convert to Red Black Trees When More Than 8 Items in a Map Bucket

31 Why Convert to Red-Black Trees When More Than 8 Items in a Map Bucket #

In this lesson, we will mainly explain why buckets in Map are converted to red-black trees only when there are more than 8 elements.

Both HashMap and ConcurrentHashMap in JDK 1.8 have the following feature: the initial Map is empty, meaning there are no elements in it. When elements are added, the hash value is calculated, and the first value will occupy a bucket (also known as a slot point). If subsequent elements need to be placed in the same bucket, they will be linked together in a linked list, commonly known as “chaining”, as shown in the figure:

img

In the figure, some buckets are empty, such as the 4th one; some have only one element, such as 1, 3, and 6; and some use the chaining method, such as the 2nd and 5th buckets.

When the length of the linked list is greater than or equal to the threshold (default is 8), if the capacity is also greater than or equal to MIN_TREEIFY_CAPACITY (default is 64), the linked list will be transformed into a red-black tree. Similarly, if the size is adjusted due to deletion or other reasons and the number of nodes in the red-black tree is less than or equal to 6, it will be converted back to a linked list.

Let’s take a look at the structure diagram of HashMap again:

img

In the diagram, we can see that some slots are empty, some are linked lists, and some are red-black trees.

Most of the time, we care about why the conversion to red-black tree is necessary and the characteristics of red-black trees. However, why is the threshold for conversion set to 8 by default? To understand why it is set to 8, we first need to know why the conversion is necessary because it is the first step.

When traversing a linked list, the average time complexity of searching is O(n), where n is the length of the linked list. Red-black trees have different search performance compared to linked lists. Due to the self-balancing characteristics of red-black trees, they can prevent imbalance situations from occurring, thereby always controlling the search time complexity at O(log(n)). Initially, when the linked list is not very long, the difference between O(n) and O(log(n)) may not be significant. However, if the linked list becomes longer and longer, this difference will be reflected. Therefore, in order to improve the search performance, we need to convert the linked list into the form of a red-black tree.

Why not use a red-black tree from the beginning and instead go through a conversion process? In fact, the JDK source code comments have already explained this issue:

Because TreeNodes are about twice the size of regular nodes,
use them only when bins contain enough nodes to warrant use
(see TREEIFY_THRESHOLD). And when they become too small (due
removal or resizing) they are converted back to plain bins.

This means that a single TreeNode occupies approximately twice the space of a regular Node, so TreeNodes are only used when there are enough Nodes in the bins, and whether there are enough Nodes is determined by the value of TREEIFY_THRESHOLD. And when the number of nodes in a bucket becomes smaller due to removal or resizing, it will be converted back to ordinary linked list form in order to save space.

By checking the source code, we can find that the linked list is converted to a red-black tree when the length reaches 8 by default, and it is converted back when the length drops to 6. This reflects the idea of balancing time and space. At the beginning, when using a linked list, the space occupation is relatively small, and the query time is not a big problem due to the short length of the linked list. However, as the linked list becomes longer and longer, a red-black tree is needed to ensure efficient queries. To determine when to convert from a linked list to a red-black tree, a threshold needs to be determined, and this threshold is set to 8 by default. The source code also explains why the number 8 was chosen, the original text is as follows:

In usages with well-distributed user hashCodes, tree bins are rarely used.
Ideally, under random hashCodes, the frequency of nodes in bins follows a Poisson distribution
(http://en.wikipedia.org/wiki/Poisson_distribution) with a parameter of about 0.5 on average for the default resizing
threshold of 0.75, although with a large variance because of resizing granularity. Ignoring variance, the expected
occurrences of list size k are (exp(-0.5) * pow(0.5, k) / factorial(k)). The first values are:

 0:    0.60653066

 1:    0.30326533

 2:    0.07581633

 3:    0.01263606

 4:    0.00157952

The meaning of the above passage is that if the hashCode distribution is good, meaning the results of the hash computation are well-dispersed, the red-black tree form is rarely used because values are evenly distributed and it is rare to have a very long linked list. In an ideal situation, the length of the linked list follows the Poisson distribution, and the probability of hitting each length decreases successively. When the length is 8, the probability is only 0.00000006. This is a probability less than one in ten million. Typically, we don’t store so much data in our Map, so in most cases, there will be no conversion from a linked list to a red-black tree.

However, HashMap determines which bucket an element falls into based on its hashCode, and the JDK does not prevent users from implementing their own hash algorithms. If we intentionally make the hash algorithm uneven, for example:

@Override
public int hashCode() {
    return 1;
}

Here, the calculated hashCode value is always 1, which can easily lead to a very long linked list in the HashMap. Let’s take a look at the following code:

public class HashMapDemo {
    public static void main(String[] args) {
        HashMap map = new HashMap<HashMapDemo,Integer>(1);
        for (int i = 0; i < 1000; i++) {
            HashMapDemo hashMapDemo1 = new HashMapDemo();
            map.put(hashMapDemo1, null);
        }
        System.out.println("运行结束");
    }
    @Override
    public int hashCode() {
        return 1;
    }
}

In this example, we create a HashMap and continuously put values into it. The key objects being put into the map have their hashCode method overridden and always return 1. When running this code and pausing it at the System.out.println(“运行结束”) statement, we can observe that the nodes inside the map have become TreeNodes instead of the usual Nodes, indicating that it has been converted to a red-black tree.

The truth is that the design of converting to a red-black tree when the length of the linked list exceeds 8 is more of a fallback strategy to prevent poor hash algorithms implemented by users from causing long linked lists and thereby resulting in low query efficiency. Converting to a red-black tree in such cases is mainly intended to ensure query efficiency under extreme circumstances.

Usually, if the hash algorithm is normal, the length of the linked list will not be very long, and the red-black tree will not bring significant advantages in query time, but will increase the space burden. Therefore, in most cases, there is no need to convert to a red-black tree, so a very small probability, less than one in ten million, namely the probability of length 8, is chosen as the default threshold for conversion.

So, if during development, HashMap or ConcurrentHashMap internally shows a red-black tree structure, it often means that our hash algorithm has a problem. We need to check if we have implemented a poorly performing hashCode method and improve it to reduce collisions.