04 Do You Really Understand Dictionary Collection

04 Do You Really Understand Dictionary Collection #

Hello, I’m Jingxiao.

In the previous lessons, we learned about lists and tuples in Python and understood their basic operations and performance comparisons. In this lesson, we will learn about two equally common and useful data structures: dictionaries (dict) and sets (set). Dictionaries and sets are widely used in Python, and their performance has been highly optimized, highlighting their importance.

Basics of Dictionaries and Sets #

So what exactly are dictionaries and sets? A dictionary is a collection of elements that consists of key-value pairs. In Python 3.7+, dictionaries are considered ordered (note: in Python 3.6, dictionary ordering is an implementation detail and only officially became a language feature in 3.7, therefore its order cannot be guaranteed in 3.6), while in versions prior to 3.6, they are unordered. Dictionaries have variable length, and elements can be freely added, deleted, and modified.

Compared to lists and tuples, dictionaries have better performance, especially for lookups, additions, and deletions, all of which can be done in constant time complexity.

Sets are similar to dictionaries, with the only difference being that sets do not have key-value pairs. Instead, sets consist of an unordered collection of unique elements.

Let’s first look at how dictionaries and sets can be created. There are several common ways to create dictionaries and sets:

d1 = {'name': 'jason', 'age': 20, 'gender': 'male'}
d2 = dict({'name': 'jason', 'age': 20, 'gender': 'male'})
d3 = dict([('name', 'jason'), ('age', 20), ('gender', 'male')])
d4 = dict(name='jason', age=20, gender='male') 
d1 == d2 == d3 ==d4
True

s1 = {1, 2, 3}
s2 = set([1, 2, 3])
s1 == s2
True

Note that in Python, both keys and values in dictionaries and sets can have mixed types. For example, in the following example, I have created a set with elements 1, 'hello', and 5.0:

s = {1, 'hello', 5.0}

Now let’s look at element access. For dictionaries, elements can be accessed directly by indexing the key, and if the key does not exist, an exception will be thrown:

d = {'name': 'jason', 'age': 20}
d['name']
'jason'
d['location']
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 'location'

Alternatively, the get(key, default) function can be used for indexing. If the key does not exist, calling get() function will return a default value. In the following example, it returns 'null':

d = {'name': 'jason', 'age': 20}
d.get('name')
'jason'
d.get('location', 'null')
'null'

After discussing dictionary access, let’s move on to sets.

First and foremost, sets do not support indexing operations because sets are essentially hash tables, unlike lists. Therefore, the following operation is invalid, and Python will throw an exception:

s = {1, 2, 3}
s[0]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'set' object does not support indexing

To determine whether an element is in a dictionary or set, we can use value in dict/set:

s = {1, 2, 3}
1 in s
True
10 in s
False

d = {'name': 'jason', 'age': 20}
'name' in d
True
'location' in d
False

Of course, in addition to creation and access, dictionaries and sets also support addition, deletion, and updating operations.

For dictionaries, we can add elements by assigning a value to a new key, like so: d['gender'] = 'male'. We can also update the value of a key by assigning a new value to it, like d['dob'] = '1999-02-01'. To delete an element with a specific key from a dictionary, we can use the pop() function. For example, d.pop('dob') deletes the element with key 'dob'.

For sets, elements can be added using the add() function, like s.add(4), and elements can be removed using the remove() function, like s.remove(4).

However, note that the pop() operation on sets deletes the last element in the set. Since sets are unordered, you cannot predict which element will be deleted, so this operation should be used with caution.

In practical applications, there are often cases where we need to sort dictionaries or sets, such as retrieving the top 50 pairs with the highest values.

For dictionaries, we usually sort them in ascending or descending order based on the keys or values:

d = {'b': 1, 'a': 2, 'c': 10}
d_sorted_by_key = sorted(d.items(), key=lambda x: x[0]) # sort by keys in ascending order
d_sorted_by_value = sorted(d.items(), key=lambda x: x[1]) # sort by values in ascending order
d_sorted_by_key
[('a', 2), ('b', 1), ('c', 10)]
d_sorted_by_value
[('b', 1), ('a', 2), ('c', 10)]

This returns a list where each element is a tuple consisting of the original dictionary’s key-value pairs.

As for sets, sorting is similar to lists and tuples discussed earlier. You can simply call sorted(set) to obtain a sorted list.

s = {3, 4, 2, 1}
sorted(s) # sort the elements of the set in ascending order
[1, 2, 3, 4]

Dictionary and Set Performance #

As I mentioned at the beginning of the article, dictionaries and sets are highly optimized data structures, especially for lookup, insertion, and deletion operations. Now let’s take a look at their performance in specific scenarios and compare them with other data structures like lists.

For example, in the backend of an e-commerce company, we store the ID, name, and price of each product. The requirement now is to find the price of a given product by its ID.

If we use a list to store these data structures and perform the lookup, the corresponding code would be as follows:

def find_product_price(products, product_id):
    for id, price in products:
        if id == product_id:
            return price
    return None
    
products = [
    (143121312, 100),
    (432314553, 30),
    (32421912367, 150)
]

print('The price of product 432314553 is {}'.format(find_product_price(products, 432314553)))

# Output
The price of product 432314553 is 30

Assuming the list has n elements and the lookup process needs to iterate through the list, the time complexity would be O(n). Even if we sort the list first and use binary search, it would still require a time complexity of O(log n), not to mention that the sorting of a list also requires a time complexity of O(n log n).

But if we use a dictionary to store these data, the lookup becomes very convenient and efficient, with a time complexity of O(1). The reason is simple: as mentioned earlier, a dictionary is composed of a hash table, which allows you to directly find the value corresponding to a key using its hash value.

products = {
  143121312: 100,
  432314553: 30,
  32421912367: 150
}
print('The price of product 432314553 is {}'.format(products[432314553]))

# Output
The price of product 432314553 is 30

Similarly, if the requirement changes to finding out how many different prices these products have, we can compare using the same method.

If we still choose to use a list, the corresponding code would be as follows, where A and B are two levels of loops. Assuming the original list has n elements, in the worst case, it would require a time complexity of O(n^2).

# list version
def find_unique_price_using_list(products):
    unique_price_list = []
    for _, price in products: # A
        if price not in unique_price_list: #B
            unique_price_list.append(price)
    return len(unique_price_list)

products = [
    (143121312, 100),
    (432314553, 30),
    (32421912367, 150),
    (937153201, 30)
]
print('number of unique price is: {}'.format(find_unique_price_using_list(products)))

# Output
number of unique price is: 3

But if we choose to use a set data structure, due to the highly optimized nature of sets as hash tables, their elements can’t be duplicate, and both the insertion and lookup operations have a time complexity of O(1). Therefore, the overall time complexity would be only O(n).

# set version
def find_unique_price_using_set(products):
    unique_price_set = set()
    for _, price in products:
        unique_price_set.add(price)
    return len(unique_price_set)

products = [
    (143121312, 100),
    (432314553, 30),
    (32421912367, 150),
    (937153201, 30)
]
print('number of unique price is: {}'.format(find_unique_price_using_set(products)))

# Output
number of unique price is: 3

You may not have an intuitive understanding of these time complexities, so let me give you an example of a practical work scenario to help you feel it.

The following code initializes 100,000 elements of products and calculates the runtime of using a list and a set to count the number of product prices:

import time
id = [x for x in range(0, 100000)]
price = [x for x in range(200000, 300000)]
products = list(zip(id, price))

# Calculate the time for the list version
start_using_list = time.perf_counter()
find_unique_price_using_list(products)
end_using_list = time.perf_counter()
print("time elapsed using list: {}".format(end_using_list - start_using_list))
# Output
time elapsed using list: 41.61519479751587

# Calculate the time for the set version
start_using_set = time.perf_counter()
find_unique_price_using_set(products)
end_using_set = time.perf_counter()
print("time elapsed using set: {}".format(end_using_set - start_using_set))
# Output
time elapsed using set: 0.008238077163696289

As you can see, with just 100,000 data points, the difference in speed between the two approaches is already significant. In fact, the backend data of large companies often reaches billions or even tens of billions of data points. If an inappropriate data structure is used, it can easily cause server crashes, not only affecting user experience but also causing huge financial losses to the company.

How Dictionaries and Sets Work #

Through examples and comparisons with lists, we have seen the efficiency of dictionary and set operations. But why are dictionaries and sets so efficient, especially when it comes to lookup, insertion, and deletion?

This is closely related to the internal data structure of dictionaries and sets. Unlike other data structures, the internal structure of dictionaries and sets is a hash table.

For dictionaries, this table stores three elements: a hash value, a key, and a value.
For sets, the difference is that there are no key-value pairs in the hash table, only individual elements.

Let’s take a look at the hash table structure in older versions of Python:

--+-------------------------------+
  |   hash       key      value
--+-------------------------------+
0 |   hash0      key0    value0
--+-------------------------------+
1 |   hash1      key1    value1
--+-------------------------------+
2 |   hash2      key2    value2
--+-------------------------------+
. |           ...
__+_______________________________+

It’s not difficult to imagine that as the hash table expands, it becomes more and more sparse. For example, if I have a dictionary like this:

{'name': 'mike', 'dob': '1999-01-01', 'gender': 'male'}

It would be stored in a form like this:

entries = [
['--', '--', '--']
[-230273521, 'dob', '1999-01-01'],
['--', '--', '--'],
['--', '--', '--'],
[1231236123, 'name', 'mike'],
['--', '--', '--'],
[9371539127, 'gender', 'male']
]

This design structure clearly wastes a lot of storage space. In order to improve the utilization of storage space, modern hash tables, in addition to the structure of the dictionary itself, separate the indices from the hash values, keys, and values, like this:

Indices
----------------------------------------------------
None | index | None | None | index | None | index ...
----------------------------------------------------

Entries
--------------------
hash0   key0  value0
---------------------
hash1   key1  value1
---------------------
hash2   key2  value2
---------------------
        ...
---------------------

So, for the example we just looked at, the storage form in the new hash table structure would be like this:

indices = [None, 1, None, None, 0, None, 2]
entries = [
[1231236123, 'name', 'mike'],
[-230273521, 'dob', '1999-01-01'],
[9371539127, 'gender', 'male']
]

We can see clearly that the space utilization has been greatly improved.

Now that we understand the specific design structure, let’s look at how these operations work.

Insertion Operation #

Each time an element is inserted into a dictionary or set, Python first calculates the hash value of the key (hash(key)), and then performs a bitwise AND operation with mask = PyDicMinSize - 1 to calculate the position (index = hash(key) & mask) where the element should be inserted into the hash table. If this position is empty, the element is inserted there.

If this position is already occupied, Python compares the hash values and keys of the two elements.

If both are equal, it means that the element already exists. If the values are different, the value is updated.
If one of them is different, which we usually call a hash collision, it means that the keys of the two elements are different, but the hash values are equal. In this case, Python continues to search for an empty position in the table until it finds one.

It is worth mentioning that in most cases, the simplest way to handle hash collisions is linear probing, which means searching for an empty slot starting from the current position. Of course, Python has optimized this step internally (you don’t need to delve into this, but if you’re interested, you can check the source code), making it more efficient.

Lookup Operation #

Similar to the insertion operation we discussed earlier, Python finds the position based on the hash value, and then compares the hash value and key of the element in that position with the element being searched for. If they are equal, it is returned directly. If they are not equal, the search continues until an empty slot is found or an exception is raised.

Deletion Operation #

For a deletion operation, Python temporarily assigns a special value to the element at that position and removes it when the size of the hash table is adjusted.

It is not difficult to understand that hash collisions often lead to a slowdown in dictionary and set operations. Therefore, to ensure efficiency, the hash tables in dictionaries and sets usually guarantee at least 1/3 of remaining space. As elements are continuously inserted, when the remaining space is less than 1/3, Python obtains a larger memory space and expands the hash table. However, in this case, all element positions in the table are rearranged.

Although hash collisions and resizing of the hash table can slow down operations, these situations occur very rarely. Therefore, on average, this still guarantees a time complexity of O(1) for insertion, lookup, and deletion.

Summary #

In this lesson, we learned about the basic operations of dictionaries and sets, and explained their high performance and internal storage structures.

In Python 3.7+, dictionaries are ordered data structures, while sets are unordered. Their internal hash table storage structure ensures efficient operations such as lookup, insertion, and deletion. Therefore, dictionaries and sets are often used in scenarios that require efficient element lookup and deduplication.

Thought Questions #

1. Which one of the following ways to initialize a dictionary is more efficient?

# Option A
d = {'name': 'jason', 'age': 20, 'gender': 'male'}

# Option B
d = dict({'name': 'jason', 'age': 20, 'gender': 'male'})

2. Can the keys of a dictionary be a list? Is the initialization of the dictionary in the following code correct? If not, can you explain why?

d = {'name': 'jason', ['education']: ['Tsinghua University', 'Stanford University']}

Feel free to leave a comment and share with me, and also feel free to share this article with your colleagues and friends.