24 Guide to Python Garbage Collection Mechanism

24 Guide to Python Garbage Collection Mechanism #

Hello, I’m Jingxiao.

As we all know, our contemporary computers follow the von Neumann architecture. The essence of the von Neumann architecture is an infinitely long tape, which corresponds to our memory today. In the evolution of engineering, products such as registers, volatile memory (RAM), and permanent storage (hard disk) gradually emerged. Actually, this itself stems from a contradiction: the faster the memory, the higher the unit price. Therefore, effectively utilizing every inch of space in high-speed memory is always a core aspect of system design.

Let’s go back to the Python application layer.

We know that when a Python program is running, it needs to allocate a space in memory to store temporary variables generated during runtime. After the calculation is completed, the results are output to permanent storage. If the amount of data is too large and the memory space is not managed properly, it is easy to encounter OOM (out of memory), commonly known as a memory explosion, which may cause the program to be terminated by the operating system.

For servers, which are designed to be non-interruptible systems, memory management becomes even more important, as it can easily lead to memory leaks. What is a memory leak?

  • Here, “leak” does not mean that your memory has a security issue and has been exploited by malicious programs, but rather, it means that the program itself was not well-designed, resulting in the failure to release unused memory.
  • A memory leak does not mean that your memory physically disappears, but rather that the code, after allocating a certain segment of memory, loses control over that memory due to design errors, resulting in wasted memory.

So, how does Python solve these problems? In other words, how does Python collect and recycle memory spaces that are no longer being used?

Reference Counting #

We have mentioned several times before that in Python, everything is an object. Therefore, everything you see as a variable is essentially a pointer to an object.

So how do we determine if an object can never be called again?

As we mentioned in the previous lesson, a very intuitive idea is that when the reference count (number of pointers) of an object is 0, it means that the object is not reachable and becomes garbage that needs to be collected.

Let’s look at an example:

import os
import psutil

# Display the memory size used by the current Python program
def show_memory_info(hint):
    pid = os.getpid()
    p = psutil.Process(pid)

    info = p.memory_full_info()
    memory = info.uss / 1024. / 1024
    print('{} memory used: {} MB'.format(hint, memory))


def func():
    show_memory_info('initial')
    a = [i for i in range(10000000)]
    show_memory_info('after a created')

func()
show_memory_info('finished')

########## Output ##########

initial memory used: 47.19140625 MB
after a created memory used: 433.91015625 MB
finished memory used: 48.109375 MB

In this example, you can see that when the function func() is called, the memory usage quickly increases to 433 MB after the list a is created. After the function call ends, the memory returns to normal.

This is because the list a, declared inside the function, is a local variable. When the function returns, the reference to the local variable is destroyed. At this point, the reference count of the object that a points to becomes 0, and Python triggers garbage collection, reclaiming the previously occupied memory.

After understanding this principle, let’s modify the code slightly:

def func():
    show_memory_info('initial')
    global a
    a = [i for i in range(10000000)]
    show_memory_info('after a created')

func()
show_memory_info('finished')

########## Output ##########

initial memory used: 48.88671875 MB
after a created memory used: 433.94921875 MB
finished memory used: 433.94921875 MB

In this new code, global a declares a as a global variable. Therefore, even after the function returns, the reference to the list still exists, and the object will not be garbage collected, continuing to occupy a large amount of memory.

Similarly, if we return the generated list and receive it in the main program, the reference will still exist and garbage collection will not be triggered, resulting in a large amount of memory still being occupied:

def func():
    show_memory_info('initial')
    a = [i for i in derange(10000000)]
    show_memory_info('after a created')
    return a

a = func()
show_memory_info('finished')

########## Output ##########

initial memory used: 47.96484375 MB
after a created memory used: 434.515625 MB
finished memory used: 434.515625 MB

These are the most common cases. Now, let’s take a deeper look at the reference counting mechanism in Python. As always, let’s start with the code:

import sys

a = []

# Two references, one from `a` and one from `getrefcount`
print(sys.getrefcount(a))

def func(a):
    # Four references, `a`, the Python function call stack, the function parameter, and `getrefcount`
    print(sys.getrefcount(a))

func(a)

# Two references, one from `a` and one from `getrefcount`, as the function `func` call no longer exists
print(sys.getrefcount(a))

########## Output ##########

2
4
2

A brief introduction to sys.getrefcount(): this function can be used to check the number of references to a variable. This code itself should be easy to understand, but don’t forget that getrefcount itself introduces an additional count.

Another thing to note is that when a function call occurs, an additional two references are created: one from the function call stack and another from the function parameter.

import sys

a = []

print(sys.getrefcount(a)) # Two references

b = a

print(sys.getrefcount(a)) # Three references

c = b
d = b
e = c
f = e
g = d

print(sys.getrefcount(a)) # Eight references

########## Output ##########

2
3
8

In this code, please pay attention to the fact that a, b, c, d, e, f, g all refer to the same object. sys.getrefcount() counts the number of references to an object, rather than counting the number of pointers. Therefore, there are a total of eight references in the end.

Understanding the concept of references, releasing references is a very natural and clear idea. Compared to in C language, where you need to manually free memory using free, Python’s garbage collection can be considered more convenient and effortless.

However, some people may still ask, what if I want to manually release memory? What should I do?

The method is also very simple. You just need to first call del a to delete the reference to the object, then force the garbage collection by calling gc.collect(), to clear the unreferenced objects.

import gc

show_memory_info('initial')

a = [i for i in range(10000000)]

show_memory_info('after a created')

del a
gc.collect()

show_memory_info('finish')
print(a)

########## Output ##########

initial memory used: 48.1015625 MB
after a created memory used: 434.3828125 MB
finish memory used: 48.33203125 MB

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-12-153e15063d8a> in <module>
    11 
    12 show_memory_info('finish')
---> 13 print(a)

NameError: name 'a' is not defined

At this point, doesn’t garbage collection seem simple?

I think someone must feel like they understand it now. So, if an interviewer asks: Is having a reference count of 0 the necessary and sufficient condition for garbage collection to be triggered? Are there any other possibilities?

Can you answer this question?

Circular Reference #

If you are also stuck, don’t worry. Let’s take small steps and consider a question first: if two objects reference each other and are no longer referenced by any other objects, should they be garbage collected?

Please observe the code below carefully:

def func():
    show_memory_info('initial')
    a = [i for i in range(10000000)]
    b = [i for i in range(10000000)]
    show_memory_info('after a, b created')
    a.append(b)
    b.append(a)

func()
show_memory_info('finished')

########## Output ##########

initial memory used: 47.984375 MB
after a, b created memory used: 822.73828125 MB
finished memory used: 821.73046875 MB

Here, a and b reference each other and, as local variables, they no longer exist in the program after the function func is called. But clearly, there is still memory usage! Why? Because the circular reference keeps their reference count from being 0.

Imagine if this code were in a production environment, even if the initial space occupied by a and b is not very large, over time, the memory used by Python would become larger and larger, eventually causing the server to crash, which could have catastrophic consequences.

Of course, one might argue that circular references can still be easily detected, so it’s not a big problem. However, a more subtle situation is when a reference cycle occurs. In complex engineering code, reference cycles are not always easy to spot.

So, what should we do?

In fact, Python itself can handle this situation. As we just mentioned, you can explicitly call gc.collect() to trigger garbage collection.

import gc

def func():
    show_memory_info('initial')
    a = [i for i in range(10000000)]
    b = [i for i in range(10000000)]
    show_memory_info('after a, b created')
    a.append(b)
    b.append(a)

func()
gc.collect()
show_memory_info('finished')

########## Output ##########

initial memory used: 49.51171875 MB
after a, b created memory used: 824.1328125 MB
finished memory used: 49.98046875 MB

As you can see, Python’s garbage collection mechanism is not that weak.

Python uses the mark-sweep algorithm and generational collection to enable automatic garbage collection for circular references. You may not be very familiar with these terms, so let me briefly explain them here.

First, let’s look at the mark-sweep algorithm. Let’s understand the concept of unreachable with graph theory. For a directed graph, if we start a traversal from a node and mark all the nodes it passes through, then after the traversal, the nodes that are not marked are called unreachable nodes. Obviously, the existence of these nodes is meaningless, so we need to garbage collect them.

Of course, traversing the entire graph every time is a huge performance waste for Python. Therefore, in Python’s garbage collection implementation, mark-sweep maintains a data structure using a doubly-linked list and only considers objects of container class (because only container objects can produce circular references). I won’t go into the specific algorithm here, as our focus is on the application.

The generational collection algorithm is another optimization technique.

Python divides all objects into three generations. Objects that have just been created are in generation 0. After garbage collection, objects that still exist are moved from the previous generation to the next one. The threshold for triggering automatic garbage collection for each generation can be specified separately. When the difference between the number of newly created objects and the number of deleted objects in the garbage collector reaches the corresponding threshold, garbage collection is triggered for that generation of objects.

In fact, the idea behind generational collection is that newly created objects are more likely to be garbage collected, while objects that have survived longer are more likely to continue to survive. Therefore, this approach can save a lot of computational resources and improve Python’s performance.

After learning all of this, you should be able to answer the interviewer’s question! Yes, reference counting is the simplest implementation, but remember that reference counting is not a necessary condition; it can only be considered as a sufficient non-necessary condition. As for other possibilities, circular references, which we discussed, are one of them.

Debugging Memory Leaks #

Although there is an automatic garbage collection mechanism, it is not omnipotent, and there will inevitably be leaks. Memory leaks are something we don’t want to see and can seriously impact performance. Is there a good debugging method?

The answer is yes, and now I will introduce a “handy assistant” for you.

It is objgraph, which is a very useful package for visualizing reference relationships. In this package, I mainly recommend two functions. The first one is show_refs(), which can generate a clear reference relationship graph.

With the following code and the generated reference graph, you can intuitively see that two lists are referencing each other, indicating a potential memory leak. With this information, it becomes easier to investigate at the code level.

import objgraph

a = [1, 2, 3]
b = [4, 5, 6]

a.append(b)
b.append(a)

objgraph.show_refs([a])

Another very useful function is show_backrefs(). Below is the sample code and the generated graph, please read it first:

import objgraph

a = [1, 2, 3]
b = [4, 5, 6]

a.append(b)
b.append(a)

objgraph.show_backrefs([a])

Compared to the previous reference graph, this graph appears slightly more complex. However, I still recommend mastering it because this API has many useful parameters, such as depth limit (max_depth), width limit (too_many), output format control (filename output), node filtering (filter, extra_ignore), etc. Therefore, I suggest that you carefully read the documentation before using it.

Summary #

In conclusion, let’s summarize what we’ve learned in this class. Today, we delved into Python’s garbage collection mechanism, and I would like to emphasize the following points:

  1. Garbage collection is a built-in mechanism in Python used to automatically release memory space that will no longer be used.
  2. Reference counting is the simplest implementation, but remember that it is only a necessary but not sufficient condition, as cycle detection is required to determine whether it can be collected.
  3. Python’s automatic collection algorithms include mark and sweep and generational collection, mainly aimed at garbage collection of cyclic references.
  4. For debugging memory leaks, objgraph is a good visualization analysis tool.

Thought-provoking question #

Finally, here’s a thought-provoking question for you. Can you implement a garbage collection detection algorithm on your own? My requirement is simple. Given a directed graph with a starting point, which represents the entry point of a program, and given the directed edges, your task is to output the unreachable nodes.

I hope you can think carefully about this problem and write down your answer in the comments for discussion with me. Feel free to share this article as well, so that we can communicate and improve together.