14 Answering How Lists and Tuples Are Implemented Internally

14 Answering How Lists and Tuples Are Implemented Internally #

Hello, I am Jingxiao.

In the blink of an eye, it has already been a month since this column went live, and we have unknowingly completed the first major chapter on the basics. I am very pleased to see that many students have been actively studying and leaving behind high-quality comments, which deserve our mutual thoughts and discussions. There are also some students who have carefully pondered and pointed out some inaccuracies or inappropriate expressions in the articles, for which I am very grateful.

I have already replied to most of the comments in the respective articles. However, there are some comments that are not convenient to reply to on mobile devices or are very valuable and typical, so I have specifically selected them for today’s Q&A session and will reply to them collectively.

Question 1: Implementation of List and Tuple #

The first question, raised by student Hu Yao, is about the internal implementation of lists and tuples. They would like to know if lists and tuples are implemented using a linked list or an array, or if arrays are linked in some way.

To answer this question, we can examine the source code.

Let’s start with the source code for lists in Python 3.7. You can read the content in the following two links:

listobject.h: https://github.com/python/cpython/blob/949fe976d5c62ae63ed505ecf729f815d0baccfc/Include/listobject.h#L23

listobject.c: https://github.com/python/cpython/blob/3d75bd15ac82575967db367c517d7e6e703a6de3/Objects/listobject.c#L33

I have included the specific structure of a list below:

As you can see, a list is essentially an array that is over-allocated. The ob_item field is a pointer list, where each pointer points to an element in the list. The allocated field stores the size of the allocated space for the list.

It is important to note the difference between allocated and the actual size of the list. The actual size of the list is the number of elements returned by len(list), which is referred to as ob_size in the code comments above. In practice, to optimize storage and avoid reallocating memory every time an element is added, the allocated space for a list is often larger than ob_size (see the example in the main text).

So, their relationship is: allocated >= len(list) = ob_size.

If the currently allocated space for a list is full (i.e., allocated == len(list)), Python requests more memory from the system and copies all the elements to the new space. The size of the memory allocated for a list follows this pattern:

0, 4, 8, 16, 25, 35, 46, 58, 72, 88, ...

Now, let’s analyze tuples. The following is the source code for tuples in Python 3.7. Again, I encourage you to read the content yourself:

tupleobject.h: https://github.com/python/cpython/blob/3d75bd15ac82575967db367c517d7e6e703a6de3/Include/tupleobject.h#L25

tupleobject.c: https://github.com/python/cpython/blob/3d75bd15ac82575967db367c517d7e6e703a6de3/Objects/tupleobject.c#L16

Similarly, here is the specific structure of a tuple:

As you can see, it is similar to a list, as it is also an array, but its size is fixed. Unlike a regular array, Python’s tuple has been optimized to improve efficiency in programs.

For example, when the size of a tuple is no more than 20, Python caches it in an internal free list. This allows Python to directly load the tuple from the cache if you need to create the same tuple again in the future, improving program execution efficiency.

Question 2: Why do elements become sparser in the old hash table? #

The second question, raised by student Hoo, is why elements become sparser in the old hash table.

Let’s first take a look at the schematic of the old hash table:

   +-------------------------------+
   | Hash    |   Key   |   Value   |
   +-------------------------------+
   | hash0   |  key0   |  value0   |
   +-------------------------------+
   | hash1   |  key1   |  value1   |
   +-------------------------------+
   | hash2   |  key2   |  value2   |
   +-------------------------------+
   |   ...   |   ...   |   ...     |
   +-------------------------------+

You will notice that it is an over-allocated array, and the index at which an element should be inserted is calculated based on the hash value of its key.

For example, if I have a dictionary like this:

{'name': 'mike', 'dob': '1999-01-01', 'gender': 'male'}

Then this dictionary will be stored in a format similar to the following:

entries = [
    ['--', '--', '--'],
    [-230273521, 'dob', '1999-01-01'],
    ['--', '--', '--'],
    ['--', '--', '--'],
    [1231236123, 'name', 'mike'],
    ['--', '--', '--'],
    [9371539127, 'gender', 'male']
]

Here, '--' indicates that there is no element at that position, but memory has been allocated.

We know that when the remaining space in the hash table is less than 1/3, in order to ensure the efficiency of relevant operations and avoid hash collisions, a larger memory allocation will be performed. Therefore, as the number of elements in the hash table increases, the number of positions with allocated memory but no elements also increases. As a result, the hash table becomes sparser.

On the other hand, the structure of the new hash table has changed this, greatly improving space utilization. The structure of the new hash table is as follows:

    Indices
----------------------------------------------------
None | index | None | None | index | None | index ...
----------------------------------------------------


Entries
--------------------
hash0   key0  value0
---------------------
hash1   key1  value1
---------------------
hash2   key2  value2
---------------------
        ...
---------------------

As you can see, the storage structure is divided into two arrays: Indices and Entries, with 'None' representing positions where memory has been allocated but there are no elements.

Using the same example as before, its storage mode in the new hash table would look like this:

indices = [None, 1, None, None, 0, None, 2]
entries = [
    [1231236123, 'name', 'mike'],
    [-230273521, 'dob', '1999-01-01'],
    [9371539127, 'gender', 'male']
]

The values in the Indices correspond to the respective indices in the entries. For example, the value 1 in indices corresponds to entries[1], which is 'dob': '1999-01-01'.

By comparison, we can clearly see that the space utilization in the new hash table is greatly improved compared to the old hash table.

Question 3: Confusion about Exceptions #

The third question is posed by the student who goes by the phrase “Don’t change the name until I lose weight to 140”. They are confused about the “NameError” exception. This is a very common error, and I will explain it here.

This question is actually a bit tricky. If you consult the official documentation, you will see the following sentence: “When an exception has been assigned using as target, it is cleared at the end of the except clause.”

This means that if you assign an exception to a variable in the “except” block of an exception handling statement, that variable will be cleared at the end of the “except” block, as shown in the following example:

e = 1
try:
    1 / 0
except ZeroDivisionError as e:
    try:
        pass
    finally:
        del e

Here, variable e initially points to the integer 1, but it is deleted (del e) at the end of the “except” block. Therefore, when the program executes, it will throw a “NameError” exception.

Therefore, this reminds us that when writing code, we must ensure that the variable assigned to the exception in the “except” block is not used in subsequent statements.

Question 4: Regarding polymorphism and modifying global variables #

The final question comes from student farFlight, who raised two questions:

  1. Is the polymorphism in Python’s type inference the same as polymorphism in subclass inheritance?
  2. In a function, you cannot directly modify global variables using +=, but you can modify global variables of type list using append, extend, etc. Why is that?

Let’s look at these two questions separately. For the first question, to understand the concept of polymorphism, it means having multiple different forms. Therefore, in essence, the polymorphism of type inference and polymorphism of subclass inheritance are the same, but you can understand them as two different manifestations of polymorphism.

Now let’s look at the second question. When the global variable points to an immutable object, such as an integer or a string, if you try to change its value inside a function without using the global keyword, an exception will be raised:

x = 1

def func():
    x += 1
func()
x

## Output
UnboundLocalError: local variable 'x' referenced before assignment

This is because by default, the x inside the function is treated as a local variable, and you are trying to reference it without assigning a value, which is obviously not possible.

However, if the global variable points to a mutable object, such as a list or a dictionary, you can modify it inside the function:

x = [1]

def func():
    x.append(2)
func()
x

## Output
[1, 2]

Of course, it’s worth noting that in this case, x.append(2) doesn’t change the variable x. x still points to the original list. In fact, this statement means accessing the list that x points to and adding 2 to the end of this list.

That’s all for today’s answers to these questions. You are welcome to continue writing your questions and thoughts in the comment section, and I will continue to answer them. I hope that every comment and Q&A can bring you new knowledge and value.