07 Dive Into the Design and Optimization of Hash Map

07 Dive into the design and optimization of HashMap #

Hello, I’m Liu Chao.

In the previous lecture, I mentioned the Collection interface. Apart from this interface, there is another important interface defined in Java container classes, which is the Map interface. It is mainly used to store key-value pair data.

HashMap, as one of the most commonly used containers in our daily life, I believe you are familiar with it. Today, let’s start with the underlying implementation of HashMap and gain a deeper understanding of its design and optimization.

Common Data Structures #

In the 5th lecture, when I shared about the List collection class, I mentioned that ArrayList is implemented based on an array data structure, LinkedList is implemented based on a linked list data structure, and today I am going to talk about HashMap, which is implemented based on a hash table data structure. Let’s review the commonly used data structures together, which will also help you better understand the content that follows.

Array: It stores data in a continuous block of memory units. The time complexity for finding a specific index is O(1), but when inserting data in the middle or at the beginning of an array, it requires copying and moving the subsequent elements.

Linked List: It is a non-continuous and non-sequential storage structure in physical memory units. The logical order of data elements is implemented by linking pointers in the linked list.

A linked list consists of a series of nodes, where each node (element in the linked list) can be dynamically generated at runtime. Each node contains two parts: the “data field of the storage data unit” and the “pointer field of the storage address of the next node”.

Since a linked list does not require storing elements in order, the complexity of inserting an element is O(1). However, searching for a node or accessing a specific numbered node takes O(n) time.

Hash Table: It is a data structure that allows direct access based on the key value. By mapping the key value to a position in the table, the access speed is accelerated. This mapping function is called a hash function, and the array that stores the records is called a hash table.

Tree: It is a collection of n (n ≥ 1) finite nodes organized in a hierarchical relationship, just like an upside-down tree.

Implementation Structure of HashMap #

After understanding the data structure, let’s take a look at the implementation structure of HashMap. As the most commonly used Map class, it is implemented based on a hash table, inheriting AbstractMap and implementing the Map interface.

The hash table maps the hash value of a key to a memory address, that is, it retrieves the corresponding value based on the key and stores it at the memory address. In other words, HashMap determines the storage location of the corresponding value based on the hash value of the key. With this indexing method, HashMap can retrieve data very quickly.

For example, when storing a key-value pair (x, “aa”), the hash table will use the hash function f(x) to determine the storage location of “aa”.

However, there can be a new problem. If another pair (y, “bb”) comes in, and the hash value of hash function f(y) is the same as the previous f(x), then the storage addresses of the two objects will conflict. This phenomenon is called hash collision. So how does the hash table solve this problem? There are many ways, such as open addressing, double hashing, and separate chaining.

Open addressing is simple. When a hash collision occurs, if the hash table is not full, it means that there must be empty positions in the hash table, so the key can be stored in the empty position of the conflicting one. This method has many disadvantages, such as searching and expanding, so I do not recommend it as the first choice for resolving hash collisions.

Double hashing, as the name suggests, calculates another hash function address when a collision occurs, until no more collisions occur. This method is not prone to “clustering,” but it increases computational time. If we do not consider the cost of adding elements and have extremely high requirements for querying elements, we can consider using this algorithm design.

HashMap considers all factors comprehensively and solves the hash collision problem using separate chaining. This method uses a data structure of an array (hash table) + linked list. When a hash collision occurs, it stores the data with the same hash value in a linked list structure.

Important Properties of HashMap #

From the source code of HashMap, we can see that HashMap is composed of an array of Nodes, where each Node contains a key-value pair.

transient Node<K,V>[] table;

Node class is an inner class in HashMap. In addition to the key and value attributes, it also defines a next pointer. When there is a hash collision, HashMap uses the Node object stored in the previous array with the same hash value to point to the reference of the newly added Node object with the same hash value.

static class Node<K,V> implements Map.Entry<K,V> {
    final int hash;
    final K key;
    V value;
    Node<K,V> next;
    
    Node(int hash, K key, V value, Node<K,V> next) {
        this.hash = hash;
        this.key = key;
        this.value = value;
        this.next = next;
    }
}

HashMap also has two important properties: load factor and threshold. These two key initialization parameters are involved when initializing HashMap.

int threshold;
   
final float loadFactor;

LoadFactor property is used to indirectly set the memory space of the Entry array (hash table). By default, when initializing HashMap without setting any parameters, the LoadFactor value is 0.75. Why is the value 0.75?

This is because for a hash table using the linked list method, the average time complexity to find an element is O(1+n), where n is the length of the linked list. Therefore, the larger the load factor, the more space is utilized, which means the longer the linked list, the lower the search efficiency. If the load factor is set too small, the data in the hash table will be too sparse, causing serious space waste.

Is there any way to solve the problem of high query time complexity caused by a long linked list? Think about it, I will talk about it in the later content.

The threshold of the Entry array is calculated based on the initial capacity and the LoadFactor. When initializing HashMap without setting any parameters, the default threshold is 12. If the number of Nodes in the HashMap exceeds the threshold, HashMap calls the resize() method to reallocate the table array. This will cause the array copy of the HashMap and migrate it to another memory location, thereby affecting the efficiency of the HashMap.

Optimization of Adding Elements to HashMap #

After initialization, HashMap can use the put() method to add key-value pairs. From the source code below, it can be seen that when a key-value pair is added to HashMap, the program first obtains the hash value based on the hashCode() return value of the key, and then calculates the hash value through the hash() method. The position of the Node is determined by (n - 1) & hash in the putVal method.

 public V put(K key, V value) {
        return putVal(hash(key), key, value, false, true);
    }
 static final int hash(Object key) {
        int h;
        return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
    }
  if ((tab = table) == null || (n = tab.length) == 0)
            n = (tab = resize()).length;
        // (n - 1) & hash determines the storage position of the Node in the putVal method
        if ((p = tab[i = (n - 1) & hash]) == null)
            tab[i] = newNode(hash, key, value, null);

If you are not familiar with the algorithm of hash() and (n-1)&hash, please read the detailed description below.

Let’s first understand the algorithm in the hash() method. What would happen if we didn’t use the hash() method to calculate the hashCode, but directly used the hashCode value of the object?

Suppose we want to add two objects, a and b, and the length of the array is 16. At this time, objects a and b are calculated by the formula (n - 1) & hash, which is (16-1) & a.hashCode and (16-1) & b.hashCode. The binary representation of 15 is 0000000000000000000000000001111. Assuming that the hashCode of object A is 1000010001110001000001111000000, and the hashCode of object B is 0111011100111000101000010100000, you will find that the results of the above bitwise AND operations are all 0. This kind of hashing result is disappointing, obviously not a good hashing algorithm.

But if we shift the hashCode value 16 bits to the right (h »> 16 represents an unsigned right shift by 16 bits), which is to take half of the int type, we can divide the binary number in half and use bitwise XOR (if the corresponding positions of the two numbers are opposite, the result is 1, otherwise it is 0). In this way, we can avoid the above situation. This is the specific implementation of the hash() method. In short, it is to shuffle the low 16 bits of hashCode that are actually involved in the computation as much as possible.

Let me explain how (n - 1) & hash is designed. Here, n represents the length of the hash table. Conventionally, the length of the hash table is set to 2^n, which ensures that the calculated index value of (n - 1) & hash is always within the index of the table array. For example: when hash=15 and n=16, the result is 15; when hash=17 and n=16, the result is 1.

After obtaining the storage position of the Node, if it is determined that the Node is not in the hash table, a new Node is added to the hash table. I will illustrate the entire process with a diagram:

img

From the diagram, we can see that: In JDK 1.8, HashMap introduces the red-black tree data structure to improve the query efficiency of the linked list.

This is because the query efficiency of the red-black tree is higher than that of the linked list when the length of the linked list exceeds 8. Therefore, when the linked list exceeds 8, HashMap will transform the linked list into a red-black tree. It is worth noting that when this happens, the efficiency of addition will decrease due to the existence of left rotation and right rotation. With regard to the issue I mentioned earlier about “high time complexity of querying due to long linked lists,” it can be solved easily.

Here is the implementation code for put:

 final V putVal(int hash, K key, V value, boolean onlyIfAbsent,
                   boolean evict) {
        Node<K,V>[] tab; Node<K,V> p; int n, i;
        if ((tab = table) == null || (n = tab.length) == 0)
// 1. If table is null or the length of tab is 0, i.e., table has not been initialized yet, initialize the table by calling resize()
            n = (tab = resize()).length;
        if ((p = tab[i = (n - 1) & hash]) == null)
// 1.1. Calculate the value obtained by (n - 1) & hash as the index i of tab, and let p represent tab[i], which is the position of the first node in the linked list. Check if p is null
            tab[i] = newNode(hash, key, value, null);

The above explanation provides a detailed understanding of the optimization of adding elements to HashMap. tab[i] = newNode(hash, key, value, null); // 1.1.1 When p is null, it means there is no element on tab[i], so the first Node node is created next, and the newNode method is called to return the new node assigned to tab[i] else { // 2.1 Here we enter the case where p is not null, which has three situations: p is a linked list node; p is a red-black tree node; p is a linked list node but the length is the critical length TREEIFY_THRESHOLD, and inserting any element will turn it into a red-black tree. Node<K,V> e; K k; if (p.hash == hash && ((k = p.key) == key || (key != null && key.equals(k)))) // 2.1.1 In HashMap, the condition for determining that the keys are the same is that they have the same hash and meet the equals method. Here, it is judged whether p.key is equal to the inserted key. If they are equal, the reference of p is assigned to e.

                e = p;
            else if (p instanceof TreeNode)
// 2.1.2 Now we start the first case, where p is a red-black tree node, so it must still be a red-black tree node after insertion, so we directly use the type conversion p and call the TreeNode.putTreeVal method, and assign the reference returned to e
                e = ((TreeNode<K,V>)p).putTreeVal(this, tab, hash, key, value);
            else {
// 2.1.3 Next is the case where p is a linked list node, which is the other two cases mentioned above: still a linked list after insertion / turning into a red-black tree after insertion. In addition, the type conversion code above also indicates that TreeNode is a subclass of Node
                for (int binCount = 0; ; ++binCount) {
// We need a counter to count the number of elements in the current linked list, and traverse the linked list. binCount is this counter
 
                    if ((e = p.next) == null) {
                        p.next = newNode(hash, key, value, null);
                        if (binCount >= TREEIFY_THRESHOLD - 1) 
// After successful insertion, it needs to be determined whether it needs to be converted into a red-black tree, because the length of the linked list is increased by 1 after insertion, and binCount does not include the new node, so when judging, the critical threshold needs to be reduced by 1
                            treeifyBin(tab, hash);
// When the new length meets the conversion conditions, call the treeifyBin method to convert the linked list to a red-black tree
                        break;
                    }
                    if (e.hash == hash &&
                        ((k = e.key) == key || (key != null && key.equals(k))))
                        break;
                    p = e;
                }
            }
            if (e != null) { // existing mapping for key
                V oldValue = e.value;
                if (!onlyIfAbsent || oldValue == null)
                    e.value = value;
                afterNodeAccess(e);
                return oldValue;
            }
        }
        ++modCount;
        if (++size > threshold)
            resize();
        afterNodeInsertion(evict);
        return null;
    }

Optimization of Element Retrieval in HashMap #

When there are only arrays in the HashMap and no Node linked list, it is the best performance scenario for data retrieval in HashMap. Once a significant number of hash collisions occur, Node linked lists are generated, and every element retrieval may require traversing the Node linked list, thus reducing the efficiency of data retrieval.

Especially when the length of the linked list becomes long, the performance will significantly decrease. The use of a red-black tree effectively solves this problem, reducing the average complexity of element retrieval to O(log(n)). As the linked list grows longer, the efficiency improvement of using a red-black tree replacement becomes more significant.

In our coding practice, we can also optimize the performance of HashMap. For example, by redefining the hashCode() method of the key value, we can reduce the occurrence of hash collisions and thus minimize the creation of linked lists. This allows us to efficiently utilize the HashMap and achieve improved performance.

HashMap Optimization for Capacity Expansion #

HashMap is also an array-based data structure, so it also has the possibility of capacity expansion.

In JDK 1.7, the entire process of HashMap’s capacity expansion is to first retrieve the elements from the array. Usually, the element retrieved is the last element inserted into the linked list. Then, iterate through the linked list starting from that element. For each iterated element, calculate its index in the new array based on its hash value, and then perform the swap operation. This way, the tail of the original hash conflict linked list becomes the head of the linked list after capacity expansion.

In JDK 1.8, HashMap has optimized the process of capacity expansion. Since the length of the expanded array is twice the original length, for example, from a tableSize of 4 to 8, it is the transformation from 0100 to 1000 (left shift by 1 position is a 2x increase). During capacity expansion, only the original hash value and the left-shifted one position (the value of newtable) need to be bitwise ANDed to determine whether the index remains the same or becomes the original index plus the length of the array before expansion.

The reason why we can reallocate the index using this “bitwise AND” operation is that the hash value is inherently random. When the hash value is bitwise ANDed with newTable, it randomly results in either 0 (original index position before expansion) or 1 (index position consisting of the original index plus the length of the array before expansion). Therefore, the process of expansion can redistribute the previously hashed conflict elements to different indexes in a random manner.

Summary #

HashMap stores key-value pairs in the form of a hash table data structure, which has the advantage of high efficiency in querying key-value pairs.

When using HashMap, we can set the initial capacity and load factor parameters based on our specific scenario. If the query operation is frequent, we can reduce the load factor appropriately; if the requirement for memory utilization is high, we can increase the load factor.

We can also set the initial capacity in advance when we know the expected data size (initial capacity = expected data size / load factor). This can reduce the resize() operation and improve the efficiency of HashMap.

HashMap also uses a combination of array and linked list data structures to implement the chaining method. When there is a hash value collision, the conflicting key-value pairs can be linked as a linked list.

However, this method has a performance problem. If the linked list is too long, the time complexity of querying data will increase. HashMap solves this issue by using red-black trees in Java 8 to address the performance degradation caused by long linked lists. The following is a diagram of the HashMap data structure:

img

Thought Question #

In practical applications, when we set the initial capacity, it is generally a power of 2. Do you know why?

The reason for this is that many hash-based data structures, such as hash tables and hash maps, use the initial capacity to determine the number of buckets or slots they should have. The underlying implementation often utilizes bit manipulation operations for better performance.

By setting the initial capacity to a power of 2, the implementation can use bitwise AND operations instead of modulo operations to calculate the bucket index. Bitwise operations are generally faster and more efficient than modulo operations.

In addition, choosing a power of 2 for the initial capacity ensures that the number of buckets or slots remains a power of 2 after resizing. This property allows for more efficient rehashing and reduces the likelihood of collisions, which can improve the overall performance of the data structure.

Therefore, setting the initial capacity to a power of 2 is a common practice in order to optimize the performance and efficiency of hash-based data structures in practical applications.