17 Data Structure Software System Core Component Hash Table Memory Layout

17 Data structures: Core component of software systems - hash table and memory layout #

Hello, I’m Chen Tian.

In the last session, we delved into slices and compared arrays, lists, strings, their slices, and the relationship between slices and references. Today, let’s continue to discuss another very important collection container in Rust: HashMap, also known as a hash table.

When it comes to the most important and frequently appearing data structures in software development, hash tables definitely rank among them. Many programming languages even have hash tables as an intrinsic data structure, built into the core of the language. Examples include PHP’s associative arrays, Python’s dictionaries, JavaScript’s objects, and maps.

Google engineer Matt Kulukundis said in his cppCon 2017 presentation that 1% of the CPU time on Google’s servers worldwide is used to calculate hash tables, and more than 4% of the memory is used to store hash tables. This more than proves the importance of hash tables.

We know that hash tables, like lists, are used for data structures that require random access. If the input and output of a data structure can correspond one-to-one, lists can be used. If they can’t correspond one-to-one, we need to use hash tables.

  • An illustrative diagram of hierarchical memory with tree structures and linked elements

Rust’s HashMap #

So, what kind of hash table does Rust provide us with? What does it look like? How is its performance? Let’s start learning from the official documentation.

If you open the HashMap documentation page, you’ll see this line:

A hash map implemented with quadratic probing and SIMD lookup.

That’s a bit of an adrenaline rush, with two high-end terms appearing: quadratic probing and SIMD lookup. What do they mean? They are the core of the Rust hash table algorithm design and our study today will revolve around these two terms. So, don’t worry; by the end of the session, you should understand what they mean.

Let’s cover the basic theory first. The core feature of hash tables is the huge number of potential inputs against a finite hash table capacity. This leads to hash collisions, where the hashes of two or more inputs are mapped to the same location, so we need to be able to deal with hash collisions.

To resolve collisions, we can first alleviate the issue with better, more evenly distributed hash functions, and by using larger hash tables. But collisions can’t be entirely avoided, so we need collision resolution mechanisms.

How do we resolve collisions? #

Theoretically, the main collision resolution mechanisms are chaining and open addressing.

Chaining, which we are more familiar with, involves connecting data that lands on the same hash with a singly linked list or a doubly linked list. Therefore, when searching, you first find the corresponding hash bucket and then compare each element on the collision chain until you find the matching entry.

  • An illustrative diagram of a hash table with chained collisions

Chaining handles hash collisions in a very intuitive way and is easy to understand and code. However, the downside is that the hash table and the collision chain use different memory, which is not cache-friendly.

Open addressing treats the whole hash table as a large array, not introducing extra memory. When a collision occurs, it inserts the data into other free positions according to certain rules. For instance, with linear probing, it continuously probes forward in case of a hash collision until a free position is found for insertion.

Quadratic probing essentially searches for a free position at the location of the hash plus-minus n squared when a collision occurs. The following diagram makes it easier to understand:

  • An illustrative diagram of quadratic probing in a hash table layout

There are other open addressing schemes, such as double hashing, but we will not go into detail about them today.

Alright, having clarified the theoretical knowledge about hash table quadratic probing, we can speculate that Rust does not use chaining to resolve hash collisions. Instead, it uses open addressing with quadratic probing. Of course, we will see later that Rust’s implementation of quadratic probing differs somewhat from the theoretical method.

Another keyword, using SIMD for single instruction, multiple data lookups, is also closely related to Rust’s clever memory layout for hash tables.

HashMap’s Data Structure #

Let’s get to the heart of the matter and take a look at the data structure of Rust’s hash table by inspecting the standard library’s source code:

use hashbrown::hash_map as base;

#[derive(Clone)]
pub struct RandomState {
    k0: u64,
    k1: u64,
}

pub struct HashMap<K, V, S = RandomState> {
    base: base::HashMap<K, V, S>,
}

We can see that HashMap has three generic parameters: K and V represent key/value types, and S is the hash algorithm’s state, which by default is RandomState, occupying two u64s. RandomState uses SipHash as the default hash function, a cryptographically secure hashing function.

From the definition, we also see that Rust’s HashMap reuses the HashMap from hashbrown. Hashbrown is an improved implementation of Google’s Swiss Table. If we open hashbrown’s code, we’ll see its structure:

pub struct HashMap<K, V, S = DefaultHashBuilder, A: Allocator + Clone = Global> {
    pub(crate) hash_builder: S,
    pub(crate) table: RawTable<(K, V), A>,
}

We can see that HashMap has two fields, one is hash_builder, which is the RandomState as mentioned previously in the standard library; the other is the concrete RawTable:

pub struct RawTable<T, A: Allocator + Clone = Global> {
    table: RawTableInner<A>,
    // Tell dropck that we own instances of T.
    marker: PhantomData<T>,
}

struct RawTableInner<A> {
    // Mask to get an index from a hash value. The value is one less than the
    // number of buckets in the table.
    bucket_mask: usize,

    // [Padding], T1, T2, ..., Tlast, C1, C2, ...
    //                                ^ points here
    ctrl: NonNull<u8>,

    // Number of elements that can be inserted before we need to grow the table
    growth_left: usize,

    // Number of elements in the table, only really used by len()
    items: usize,

    alloc: A,
}

In RawTable, the data structure with actual significance is RawTableInner. The first four fields mentioned in our previous exploration of HashMap’s data structure are critical and we will revisit them when discussing HashMap’s memory layout:

  • A usize bucket_mask, which is the number of hash buckets in the hash table minus one;
  • A pointer called ctrl, which points to the ctrl area at the end of the hash table heap memory;
  • A usize field growth_left, indicating how many data items can be stored before the hash table needs to grow automatically;
  • A usize items, showing how much data is currently in the hash table.

At the end here, the alloc field, like RawTable’s marker, is just a placeholder type that we only need to know is used to allocate memory on the heap.

Basic Usage of HashMap #

Having understood the data structure, let’s look at how it is used. Using Rust’s hash table is very straightforward; it provides a series of convenient methods that are very similar to those of other languages. If you take a look at the documentation, you’ll find it easy to understand. Let’s write some code and give it a try (code):

use std::collections::HashMap;

fn main() {
    let mut map = HashMap::new();
    explain("empty", &map);

    map.insert('a', 1);
    explain("added 1", &map);

    map.insert('b', 2);
    map.insert('c', 3);
    explain("added 3", &map);

    map.insert('d', 4);
    explain("added 4", &map);

    // When using get, references are needed, and it returns a reference as well
    assert_eq!(map.get(&'a'), Some(&1));
    assert_eq!(map.get_key_value(&'b'), Some((&'b', &2)));

    map.remove(&'a');
    // After deletion, it can no longer be found
    assert_eq!(map.contains_key(&'a'), false);
    assert_eq!(map.get(&'a'), None);
    explain("removed", &map);
    // After shrink_to_fit, the hash table becomes smaller
    map.shrink_to_fit();
    explain("shrinked", &map);
}

fn explain<K, V>(name: &str, map: &HashMap<K, V>) {
    println!("{}: len: {}, cap: {}", name, map.len(), map.capacity());
}

Running this code, we observe the following output:

empty: len: 0, cap: 0
added 1: len: 1, cap: 3
added 3: len: 3, cap: 3
added 4: len: 4, cap: 7
removed: len: 3, cap: 7
shrinked: len: 3, cap: 3

As we can see, when HashMap::new() is called, it does not allocate space and has a capacity of zero. As the hash table is inserted with more data, it grows in powers of 2 minus one, with a minimum of 3. When data is deleted from the table, the original size remains unchanged. It is only when shrink_to_fit is explicitly called that the hash table is reduced in size.

HashMap’s Memory Layout #

However, we can’t see how HashMap is laid out in memory through the public interface of HashMap. We still need to use the std::mem::transmute method to print the data structure. Let’s make some changes to the previous code (code):

use std::collections::HashMap;

fn main() {
    let map = HashMap::new();
    let mut map = explain("empty", map);

    map.insert('a', 1);
    let mut map = explain("added 1", map);
    map.insert('b', 2);
    map.insert('c', 3);

    let mut map = explain("added 3", map);

    map.insert('d', 4);

    let mut map = explain("added 4", map);

    map.remove(&'a');

    explain("final", map);
}

// HashMap structure has two u64s for RandomState, followed by four usizes
// After using transmute to print them, we transmute it back
fn explain<K, V>(name: &str, map: HashMap<K, V>) -> HashMap<K, V> {
    let arr: [usize; 6] = unsafe { std::mem::transmute(map) };
    println!(
        "{}: bucket_mask 0x{:x}, ctrl 0x{:x}, growth_left: {}, items: {}",
        name, arr[2], arr[3], arr[4], arr[5]
    );
    unsafe { std::mem::transmute(arr) }
}

After running, we can observe:

empty: bucket_mask 0x0, ctrl 0x1056df820, growth_left: 0, items: 0
added 1: bucket_mask 0x3, ctrl 0x7fa0d1405e30, growth_left: 2, items: 1
added 3: bucket_mask 0x3, ctrl 0x7fa0d1405e30, growth_left: 0, items: 3
added 4: bucket_mask 0x7, ctrl 0x7fa0d1405e90, growth_left: 3, items: 4
final: bucket_mask 0x7, ctrl 0x7fa0d1405e90, growth_left: 4, items: 3

Interesting, we find that the heap address corresponding to ctrl changes during the run.

On my OS X, the hash table is initially empty, and the ctrl address seems to be a TEXT/RODATA segment address, probably pointing to a default empty table address. After inserting the first piece of data, the hash table allocates 4 buckets and the ctrl address changes. After inserting three pieces of data, growth_left becomes zero. When another insertion occurs, the hash table reallocates and the ctrl address continues to change.

We mentioned earlier when exploring the HashMap data structure that ctrl is a pointer to the end of the heap address’ ctrl area of the hash table, so we can calculate the starting address of the hash table heap memory through this address.

Because the hash table has 8 buckets (0x7 + 1), each bucket size is key (char) plus value (i32), which equals 8 bytes, so a total of 64 bytes. In this example, by subtracting 64 from the ctrl address, we can get the heap memory’s starting address. Then, we can use rust-gdb/rust-lldb to print this memory out (if you’re interested in rust-gdb/rust-lldb, please see the recommended readings at the end of this article).

Here I use rust-gdb under Linux to set breakpoints and examine the hash table’s state after adding 1, 3, 4 values, and deleting one value:

❯ rust-gdb ~/.target/debug/hashmap2
GNU gdb (Ubuntu 9.2-0ubuntu2) 9.2
...
(gdb) b hashmap2.rs:32
Breakpoint 1 at 0xa43e: file src/hashmap2.rs, line 32.
(gdb) r
Starting program: /home/tchen/.target/debug/hashmap2
...
# Initial state, the hash table is empty
empty: bucket_mask 0x0, ctrl 0x555555597be0, growth_left: 0, items: 0

Breakpoint 1, hashmap2::explain (name=..., map=...) at src/hashmap2.rs:32
32      unsafe { std::mem::transmute(arr) }
(gdb) c
Continuing.
# After inserting one element, the bucket has 4 (0x3+1), heap start address is 0x5555555a7af0 - 4*8(0x20)
added 1: bucket_mask 0x3, ctrl 0x5555555a7af0, growth_left: 2, items: 1

Breakpoint 1, hashmap2::explain (name=..., map=...) at src/hashmap2.rs:32
32      unsafe { std::mem::transmute(arr) }
(gdb) x /12x 0x5555555a7ad0
0x5555555a7ad0:   0x00000061      0x00000001      0x00000000      0x00000000
0x5555555a7ae0:   0x00000000      0x00000000      0x00000000      0x00000000
0x5555555a7af0:   0x0affffff      0xffffffff      0xffffffff      0xffffffff
(gdb) c
Continuing.
# After inserting three elements, the hash table is without remaining space, heap start address remains the same 0x5555555a7af0 - 4*8(0x20)
added 3: bucket_mask 0x3, ctrl 0x5555555a7af0, growth_left: 0, items: 3

Breakpoint 1, hashmap2::explain (name=..., map=...) at src/hashmap2.rs:32
32      unsafe { std::mem::transmute(arr) }
(gdb) x /12x 0x5555555a7ad0
0x5555555a7ad0:   0x00000061      0x00000001      0x00000062      0x00000002
0x5555555a7ae0:   0x00000000      0x00000000      0x00000063      0x00000003
0x5555555a7af0:   0x0a72ff02      0xffffffff      0xffffffff      0xffffffff
(gdb) c
Continuing.
# After inserting the fourth element, the hash table expands, heap start address becomes 0x5555555a7b50 - 8*8(0x40)
added 4: bucket_mask 0x7, ctrl 0x5555555a7b50, growth_left: 3, items: 4

Breakpoint 1, hashmap2::explain (name=..., map=...) at src/hashmap2.rs:32
32      unsafe { std::mem::transmute(arr) }
(gdb) x /20x 0x5555555a7b10
0x5555555a7b10:   0x00000061      0x00000001      0x00000000      0x00000000
0x5555555a7b20:   0x00000064      0x00000004      0x00000063      0x00000003
0x5555555a7b30:   0x00000000      0x00000000      0x00000062      0x00000002
0x5555555a7b40:   0x00000000      0x00000000      0x00000000      0x00000000
0x5555555a7b50:   0xff72ffff      0x0aff6502      0xffffffff      0xffffffff
(gdb) c
Continuing.
# After deleting 'a', remaining 4 positions. Note the change of ctrl bits, and 0x61 0x1 has not been cleared
final: bucket_mask 0x