09 Comparing the Differences Between Hashtable, Hash Map, and Tree Map

记住，hashtable 和 hashmap 是旧的map实现方式，而 treemap 是基于红黑树(Red-Black Tree)的实现的。

Hashtable：它是一个线程安全的类，可以被多个线程同时访问。它通过使用 synchronized 关键字来实现线程安全。然而，这也意味着在多线程环境下，使用 hashtable 可能会带来性能问题。此外，hashtable 不允许键或值为 null。
HashMap：它是 hashtable 的非线程安全版本。由于它不是线程安全的，所以在多线程环境下使用 hashmap 时需要额外的同步机制。相较于 hashtable，hashmap 有更好的性能表现。hashmap 允许键和值为 null。
TreeMap：它实现了 SortedMap 接口, 因此会以键的排序顺序来维护其中的元素。treemap 的键必须实现 Comparable 接口或传入一个 Comparator 对象才能进行排序。与 hashtable 和 hashmap 不同，treemap 是基于红黑树结构实现的。由于红黑树的特殊性质，treemap 的时间复杂度为 O(log n)，在对 map 元素进行操作时具有较高的性能。

对于 HashMap，你需要理解 hash 算法的工作原理，以及如何处理 hash 冲突（当两个键具有相同的 hash 码时）。你还需要知道如何选择适当的初始容量和加载因子，以获得更好的性能。

希望以上对比和 HashMap 的介绍对你有所帮助。

09 Comparing the differences between Hashtable, HashMap, and TreeMap #

Hashtable, HashMap, and TreeMap are some of the most common implementations of Map, which are container types used to store and manipulate data in the form of key-value pairs.

Hashtable is an early implementation of a hash table provided by the Java library. It is synchronized and does not support null keys and values. Due to the performance overhead caused by synchronization, it is rarely recommended for use.

HashMap is a more widely used implementation of a hash table. Its behavior is similar to Hashtable, with the main difference being that HashMap is not synchronized and supports null keys and values. In general, HashMap offers constant-time performance for put and get operations, making it the preferred choice for most key-value storage scenarios, such as implementing a runtime storage structure for user IDs and user information.

TreeMap, on the other hand, is a Map implementation based on a red-black tree. Unlike HashMap, its get, put, and remove operations have a time complexity of O(log(n)). The specific order can be determined by a specified Comparator or based on the natural order of the keys.

Examination Focus Analysis #

The above answer is just a simple summary of some basic features. There are many questions that can be expanded upon regarding Map. From various data structures and typical application scenarios to the technical considerations of program design implementation, especially in Java 8, HashMap itself has undergone significant changes, which are frequently tested aspects.

Many friends have provided feedback that interviewers seem to prefer examining the design and implementation details of HashMap. So today, I will add relevant source code explanations, focusing mainly on the following aspects:

Understanding the overall structure of Map-related classes, especially the key points of ordered data structures.
Analyzing the design and implementation points of HashMap from the source code, understanding parameters such as capacity and load factor, why these parameters are needed, how they affect the performance of Map, and how to make trade-offs in practical use.
Understanding the principles and improvement reasons related to tree transformation.

In addition to typical code analysis, there are also some interesting concurrency-related issues that are often mentioned, such as HashMap may cause infinite loops and consume CPU in a concurrent environment, or inaccurate size.

I think this is a typical usage error because HashMap explicitly states that it is not a thread-safe data structure. If you ignore this point and simply use it in a multi-threaded scenario, problems are bound to occur.

Understanding the reasons for these errors is also a good way to deepen your understanding of running concurrent programs. For specific details on what happens, you can refer to this analysis from a long time ago. It even provides a diagram. I won’t repeat the content that others have already written.

Knowledge Expansion #

Overall structure of Map

First, let’s have a general understanding of the different types of Map. Although Map is usually included in the Java collection framework, it is not strictly a collection type itself. You can refer to the following class diagram for more details.

Map class diagram

Hashtable is special. As an early collection-related type similar to Vector and Stack, it extends the Dictionary class, which is structurally different from HashMap and similar Map implementations.

Other Map implementations like HashMap extend AbstractMap, which contains common abstract methods. The purpose of different Map implementations is reflected in their different interfaces.

In most cases where Map is used, the main operations are putting, accessing, or removing elements, with no special requirement for order. In such cases, HashMap is the best choice. The performance of HashMap depends heavily on the validity of hash codes, so it is essential to understand some basic conventions of hashCode and equals, including:

If two objects are equal, their hash codes must also be equal.
If you override hashCode, you must also override equals.
The hash code should be consistent, meaning it should remain the same even if the object’s state changes.
equals should possess symmetry, reflexivity, and transitivity.

There are many resources available online about this topic, so I won’t go into detail here.

The analysis of ordered Map is limited, so I’ll provide some additional information. Although LinkedHashMap and TreeMap both provide a certain order, they are quite different.

LinkedHashMap usually maintains insertion order during iteration by maintaining a doubly-linked list of entries. Note that by using a specific constructor, we can create an instance that reflects access order, where put, get, compute, etc., count as “access”.

This behavior is suitable for specific application scenarios. For example, if we need to build a resource pool sensitive to space usage and automatically release objects that are least frequently accessed, we can utilize the mechanism provided by LinkedHashMap. Here’s an example:

import java.util.LinkedHashMap;
import java.util.Map;

public class LinkedHashMapSample {
    public static void main(String[] args) {
        LinkedHashMap<String, String> accessOrderedMap = new LinkedHashMap<String, String>(16, 0.75F, true){
            @Override
            protected boolean removeEldestEntry(Map.Entry<String, String> eldest) {
                return size() > 3;
            }
        };
        accessOrderedMap.put("Project1", "Valhalla");
        accessOrderedMap.put("Project2", "Panama");
        accessOrderedMap.put("Project3", "Loom");
        accessOrderedMap.forEach( (k,v) -> {
            System.out.println(k +":" + v);
        });
        // Simulating access
        accessOrderedMap.get("Project2");
        accessOrderedMap.get("Project2");
        accessOrderedMap.get("Project3");
        System.out.println("Iterate over should be not affected:");
        accessOrderedMap.forEach( (k,v) -> {
            System.out.println(k +":" + v);
        });
        // Trigger removal
        accessOrderedMap.put("Project4", "Mission Control");
        System.out.println("Oldest entry should be removed:");
        accessOrderedMap.forEach( (k,v) -> {
            System.out.println(k +":" + v);
        });
    }
}

For TreeMap, the overall order is determined by the order of keys, which is specified by a Comparator or the natural order (Comparable).

In the previous exercise I mentioned, where you need to build a priority-based scheduling system, it is essentially a typical priority queue scenario. The Java standard library provides PriorityQueue, implemented based on a binary heap, and TreeMap with its alter ego TreeSet, which both rely on the same sorting mechanism. Just like the conventions for hashCode and equals, to avoid ambiguity, the natural order should also comply with a convention: the return value of compareTo should be consistent with equals; otherwise, ambiguous behavior may occur.

Let’s analyze the put method implementation of TreeMap:

public V put(K key, V value) {
    Entry<K,V> t = …
    cmp = k.compareTo(t.key);
    if (cmp < 0)
        t = t.left;
    else if (cmp > 0)
        t = t.right;
    else
        return t.setValue(value);
    // ...
}

What can you observe from the code? When I don’t follow the convention, two objects that do not meet the uniqueness requirement of equals (because compareTo returns 0) will be treated as the same object, leading to ambiguous behavior.

HashMap Source Code Analysis

As mentioned earlier, analyzing the design and implementation of HashMap is a common interview question, so I will provide a relatively detailed explanation here, focusing on:

The key points of the internal implementation of HashMap.
Capacity and load factor.
Treeify.

First, let’s take a look at the internal structure of HashMap. It can be seen as a composite structure of an array (Node[] table) and linked lists. The array is divided into buckets, and the hash value determines the address of the key-value pair in the array. Key-value pairs with the same hash value are stored in the form of linked lists. Note that if the size of the linked list exceeds a threshold (TREEIFY_THRESHOLD, 8), the linked list will be transformed into a tree-like structure.

HashMap internal structure

From the implementation of the non-copy constructor, it seems that the table (array) is not initially initialized but is set with some initial values only.

public HashMap(int initialCapacity, float loadFactor) {
    // ...
    this.loadFactor = loadFactor;
    this.threshold = tableSizeFor(initialCapacity);
}

So, we suspect that HashMap is initialized according to the lazy-load principle and only initialized when first used (except for the copy constructor, which I will not cover here). Since that is the case, let’s take a look at the put method implementation. It seems that only the putVal method is called:

public V put(K key, V value) {
    return putVal(hash(key), key, value, false, true);
}

It seems that the main secret lies in the putVal method. To save space, I will only show the key parts of the putVal method.

final V putVal(int hash, K key, V value, boolean onlyIfAbsent,
               boolean evict) {
    Node<K,V>[] tab; Node<K,V> p; int , i;
    if ((tab = table) == null || (n = tab.length) = 0)
        n = (tab = resize()).length;
}

From the code, what can you see? If the table is null or its length is 0, the resize method will be called to resize the table array. From the initial lines of the putVal method, we can find several interesting points:

If the table is null, the resize method will initialize it, as seen from tab = resize().
The resize method has two responsibilities: creating the initial storage table or resizing it when the capacity is not sufficient.
During the process of putting a new key-value pair, resizing occurs if the following condition is met:

if (++size > threshold)
    resize();

The specific position (array index) of the key-value pair in the hash table depends on the following bitwise operation:

i = (n - 1) & hash

Upon careful observation of the source of the hash value, we can see that it is not the hashCode of the key itself, but rather another hash method within HashMap. Why do we need to XOR the high bits shifted to the low bits? This is because some data may have significant differences in their high bits when calculating the hash value, and the hash addressing in HashMap ignores the high bits beyond the capacity. Therefore, this processing effectively avoids hash collisions in such cases:

static final int hash(Object key) {
    int h;
    return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>>16);
}

As mentioned earlier, the linked list structure (referred to as “bin” here) undergoes treeification when a certain threshold is reached. I will explain why HashMap needs to handle bins later.

As you can see, the putVal method itself is very concentrated in terms of logic. It is responsible for initialization, resizing, and treeification. When reading the source code, I recommend focusing on the main logic mentioned above.

Let’s further analyze the multi-functional resize method, which many people have reported being frequently asked about by interviewers.

final Node<K,V>[] resize() {
    // ...
    else if ((newCap = oldCap << 1) < MAXIMUM_CAPACITY &&
                oldCap >= DEFAULT_INITIAL_CAPACITY)
        newThr = oldThr << 1; // double the threshold
       // ... 
    else if (oldThr > 0) // initial capacity was placed in threshold
        newCap = oldThr;
    else {  
        // zero initial threshold signifies using defaults
        newCap = DEFAULT_INITIAL_CAPACITY;
        newThr = (int)(DEFAULT_LOAD_FACTOR * DEFAULT_INITIAL_CAPACITY;
    }
    if (newThr == 0) {
        float ft = (float)newCap * loadFactor;
        newThr = (newCap < MAXIMUM_CAPACITY && ft < (float)MAXIMUM_CAPACITY ? (int)ft : Integer.MAX_VALUE);
    }
    threshold = newThr;

Node<K,V>[] newTab = (Node<K,V>[])new Node[newap];
table = n;
// Move to the new array structure e array structure 
}

According to the `resize` source code, without considering extreme situations (the maximum capacity is determined by `MAXIMUM_CAPACITY`, which is 1<<30, i.e., 2 to the power of 30), we can summarize as follows:

- The threshold is equal to (load factor) multiplied by (capacity), and if they are not specified when creating the HashMap, they are based on the corresponding default constant values.
- The threshold is usually adjusted by multiplying it by a certain factor (`newThr = oldThr << 1`). As I mentioned earlier, based on the logic in `putVal`, the Map size is adjusted when the number of elements exceeds the threshold.
- After resizing, the elements in the old array need to be repositioned in the new array, which is the main cost of resizing.

1. Capacity, Load Factor, and Treeification

We have briefly discussed the relevant logic of HashMap from creation to insertion of key-value pairs. Now let's think about why we need to consider capacity and load factor.

This is because the capacity and load factor determine the number of available buckets. Having too many empty buckets will waste space, while having too many elements in each bucket will severely affect the performance of operations. In extreme cases, if we assume there is only one bucket, it degenerates into a linked list and cannot provide the so-called constant-time storage performance at all.

Since capacity and load factor are so important, how should we choose them in practice?

If we know the number of key-value pairs to be stored in the HashMap, we can consider setting an appropriate capacity in advance. We can estimate the capacity based on the conditions for resizing. Based on the code analysis above, we know that it needs to satisfy the calculation condition:

load factor * capacity > number of elements

Therefore, the pre-set capacity needs to be larger than "estimated number of elements / load factor", and it should be a power of 2. The conclusion is already very clear.

As for the load factor, I suggest:
- If there are no special requirements, do not change it easily, because the default load factor of the JDK itself is very suitable for general scenarios.
- If you do need to adjust it, it is recommended not to set a value greater than 0.75, as it will significantly increase conflicts and decrease the performance of HashMap.
- If using a load factor that is too small, adjust the pre-set capacity value according to the same formula mentioned above, otherwise it may cause more frequent resizing, increasing unnecessary overhead and affecting the access performance itself.

We mentioned treeification, which corresponds to the logic in `putVal` and `treeifyBin`.

final void treeifyBin(Node<K,V>[] tab, int hash) { int n, index; Node<K,V> e; if (tab == null || (n = tab.length) < MIN_TREEIFY_CAPACITY) resize(); else if ((e = tab[index = (n - 1) & hash]) != null) { // Logic for treeification } }


The above is a simplified representation of `treeifyBin`. Combining these two methods, the logic for treeification is very clear. When the number of bins exceeds `TREEIFY_THRESHOLD`:

- If the capacity is less than `MIN_TREEIFY_CAPACITY`, it only performs simple resizing.
- If the capacity is greater than `MIN_TREEIFY_CAPACITY`, it performs treeification.

So why does HashMap need to be treeified?

**Essentially, this is a security issue.** During the process of placing elements, if multiple objects have the same hash and are placed in the same bucket, it forms a linked list. As we know, searching in a linked list is linear and severely affects the performance of storing and retrieving data.

In the real world, it is not very difficult to construct data that causes hash collisions. Malicious code can use this data to interact with the server in large quantities, causing the server's CPU to be heavily occupied. This constitutes a hash collision denial of service attack. Leading internet companies in China have experienced similar attacks.

Today, I have compared several implementations of maps and analyzed the aspects where ordered set types are easily confused. I have also analyzed the basic structure of HashMap from the source code level. I hope this helps you.
## Practice Exercises

Do you have a clear understanding of the topic we discussed today? I'll leave you with a question to think about: What are some typical methods for resolving hash conflicts?

Please write your thoughts on this question in the comments section. I will select thoughtful comments and reward you with a learning encouragement bonus. Feel free to join me in discussing this.

Are your friends also preparing for interviews? You can "invite a friend to read" and share today's question with them. Perhaps you can help them out.