23 Do You Really Understand Python's Gil ( Global Interpreter Lock)

23 Do You Really Understand Python’s GIL (Global Interpreter Lock) #

Hello, I am Jing Xiao.

In the previous lessons, we have learned about the concurrent programming features of Python and also have some understanding of multi-threading programming. However, there is another important topic in Python multi-threading called GIL (Global Interpreter Lock) that few people know about. Even many Python “veterans” consider GIL to be a mystery. Today, I will unravel the mystery for you and take you to explore GIL.

A Puzzle #

Hearing is deceiving, seeing is believing. Let’s start with an example to understand why the Global Interpreter Lock (GIL) can be so puzzling.

For example, consider this simple CPU-bound code:

def CountDown(n):
    while n > 0:
        n -= 1

Now, let’s assume a large number, n = 100000000, and try executing CountDown(n) in a single thread. On my 8-core MacBook, it takes approximately 5.4 seconds to complete.

Next, let’s try to speed it up using multiple threads with the following code:

from threading import Thread

n = 100000000

t1 = Thread(target=CountDown, args=[n // 2])
t2 = Thread(target=CountDown, args=[n // 2])
t1.start()
t2.start()
t1.join()
t2.join()

I ran this code on the same machine and discovered that instead of improving the speed, it actually slowed down the execution, taking a total of 9.6 seconds.

Still determined, I decided to try using four threads. Surprisingly, the runtime was still 9.8 seconds, almost identical to the result of two threads.

What’s going on? Could it be that my MacBook is a fake? Take some time to think about this question or test it on your own computer. I, of course, had some self-reflection and came up with two hypotheses.

First suspicion: Is there something wrong with my machine?

This could be a reasonable guess. So, I tried the experiment on another computer with a single-core CPU. This time, I found that both single-threaded and two-threaded executions took 11 seconds. Although it didn’t exhibit the same phenomenon as the first machine where multithreading was slower than single-threading, the overall results were similar!

It seems unlikely that this is a computer-specific problem; rather, it appears that Python threads are not effectively achieving parallel computation.

Naturally, this led me to the second suspicion: Are Python threads fake threads?

Python threads do indeed encapsulate underlying operating system threads, such as POSIX Threads (Pthreads) in Linux and Windows Threads in Windows. Furthermore, Python threads are completely managed by the operating system, including coordination of execution, memory management, interrupt management, and more.

So, although Python threads and C++ threads are fundamentally different abstractions, their underlying mechanisms are not significantly different.

Why GIL exists? #

It seems that neither of my two hypotheses can explain this unsolved mystery. So who is the real “culprit”? In fact, it is our protagonist today, the GIL, that causes the performance of Python threads to not be as expected.

GIL, a technical term in the most popular Python interpreter CPython, stands for Global Interpreter Lock, which is essentially a Mutex similar to operating systems. When each Python thread executes in the CPython interpreter, it first locks its own thread, preventing other threads from executing.

Of course, CPython does some tricks by executing Python threads in turns. As a result, what users see is “pseudo-parallelism” - Python threads executing in an interleaved manner to simulate truly parallel threads.

So why does CPython need GIL? This actually has to do with the implementation of CPython. We will talk about Python’s memory management mechanism in the next section, but let’s briefly mention it today.

CPython uses reference counting to manage memory. Every instance created in a Python script has a reference count to record how many pointers are pointing to it. When the reference count becomes 0, the memory will be automatically released.

What does this mean? Let’s take a look at the following example:

>>> import sys
>>> a = []
>>> b = a
>>> sys.getrefcount(a)
3

In this example, the reference count of a is 3, because there are three places - a, b, and the getrefcount function that takes it as a parameter - that reference an empty list.

In this way, if two Python threads simultaneously reference a, a race condition in the reference count may occur, and the reference count may only increase by 1, thus causing memory corruption. This is because when the first thread ends, it will decrease the reference count by 1. At this point, it may reach the condition for releasing memory. When the second thread tries to access a again, it will not find valid memory.

Therefore, the primary reasons for introducing GIL in CPython are as follows:

First, the designers wanted to avoid complex competition risks, such as memory management race conditions.
Second, because CPython makes extensive use of C language libraries, but most C language libraries are not inherently thread-safe (thread safety reduces performance and increases complexity).

How does the Global Interpreter Lock (GIL) work? #

The following image is an example of how the GIL works in a Python program. In this example, Thread 1, 2, and 3 take turns executing. Each thread locks the GIL when it starts to prevent other threads from executing. Similarly, when a thread finishes executing a section, it releases the GIL to allow other threads to utilize resources.

GIL working example

You might have noticed a question: Why do Python threads voluntarily release the GIL? After all, if Python threads only lock the GIL when they start and never release it, other threads would never have a chance to run.

Well, CPython has another mechanism called “check_interval,” which means that the CPython interpreter periodically checks the lock status of the thread’s GIL. After a certain interval, the Python interpreter forces the current thread to release the GIL, allowing other threads to have a chance to execute.

The implementation of the check interval varies in different versions of Python. In earlier versions, it was approximately 100 ticks, which roughly corresponded to 1000 bytecodes. However, starting from Python 3, the interval is 15 milliseconds. Of course, we don’t need to delve into the specific time interval for GIL release; it should not be a dependency for our program design. We just need to understand that the CPython interpreter will release the GIL within a “reasonable” time frame.

GIL check_interval example

In summary, each Python thread is encapsulated in a similar loop. Let’s take a look at the following code:

for (;;) {
    if (--ticker < 0) {
        ticker = check_interval;

        /* Give another thread a chance */
        PyThread_release_lock(interpreter_lock);

        /* Other threads may run now */

        PyThread_acquire_lock(interpreter_lock, 1);
    }

    bytecode = *next_instr++;
    switch (bytecode) {
        /* execute the next instruction ... */ 
    }
}

From this code snippet, we can see that each Python thread first checks the ticker count. The thread only executes its own bytecode when the ticker is greater than 0.

Python Thread Safety #

However, having the Global Interpreter Lock (GIL) does not mean that Python programmers don’t need to consider thread safety. Even though we know that the GIL only allows one Python thread to execute, as mentioned earlier, Python still has a preemption mechanism called check interval. Let’s consider the following code snippet:

import threading

n = 0

def foo():
    global n
    n += 1

threads = []
for i in range(100):
    t = threading.Thread(target=foo)
    threads.append(t)

for t in threads:
    t.start()

for t in threads:
    t.join()

print(n)

If you run this code, you will notice that although most of the time it prints 100, sometimes it might print 99 or 98.

This is because the line n += 1 is not thread-safe. If you disassemble the bytecode of the foo function, you will find that it is actually composed of the following four lines of bytecode:

>>> import dis
>>> dis.dis(foo)
LOAD_GLOBAL              0 (n)
LOAD_CONST               1 (1)
INPLACE_ADD
STORE_GLOBAL             0 (n)

And each of these four lines of bytecode can be interrupted!

So, don’t ever think that having the GIL makes your program completely worry-free. We still need to pay attention to thread safety. As I mentioned at the beginning, the design of the GIL is primarily for the convenience of the CPython interpreter developers, not for Python application-level programmers. As Python users, we still need tools like locks to ensure thread safety. For example, in the following example:

n = 0
lock = threading.Lock()

def foo():
    global n
    with lock:
        n += 1

How to bypass the GIL? #

By this point, some Python users might feel like they have lost their martial arts skills, as if they only have one move left from the “Dragon Subduing Eighteen Palms”. However, there is no need to feel discouraged. The GIL (Global Interpreter Lock) in Python is a restriction added by the CPython interpreter. If your code does not need to be executed by CPython interpreter, then it is not subject to the GIL restriction.

In fact, many high-performance application scenarios already have Python libraries implemented in C, such as NumPy’s matrix operations, which are not affected by the GIL.

Therefore, in most application scenarios, you don’t need to worry too much about the GIL. Because if multithreaded computation becomes a performance bottleneck, there are often Python libraries available to solve this problem.

In other words, if your application has extremely strict performance requirements, such as a delay of 100us having a significant impact on your application, then I must say that Python may not be your optimal choice.

Of course, it is understandable that sometimes we just want to release ourselves from the constraints of the GIL temporarily, such as in deep learning applications, where most of the code is written in Python. In actual work, if we want to implement a custom differential operator or an accelerator for a specific hardware, then we have to implement the performance-critical code in C++ (not subject to the GIL), and then provide a Python calling interface.

In summary, you just need to remember the general approaches to bypass the GIL:

Bypass CPython and use other implementations such as JPython (a Python interpreter implemented in Java).
Implement the performance-critical code in another language (usually C++).

Conclusion #

In today’s lesson, we first used a practical example to understand the impact of the Global Interpreter Lock (GIL) on applications. Then, we briefly analyzed the implementation principle of the GIL. You don’t need to delve into the details of the principles, just understand its main mechanism and the potential hazards.

Of course, I also provided you with two approaches to bypass the GIL. However, as I mentioned before, in many cases, we don’t need to worry too much about the impact of the GIL.

Thinking Questions #

Finally, I have two thinking questions for you.

First, when we are dealing with CPU-bound tasks (as described in the first example in the text), why is it sometimes slower to use multiple threads than a single thread?

Second, do you think the GIL (Global Interpreter Lock) is a good design? In fact, there have been many discussions about improving or even removing the GIL since Python 3. What are your thoughts on this? Have you ever encountered situations in your everyday work where the GIL has been a problem?

Feel free to write down your thoughts in the comments section, and also feel free to share today’s content with your colleagues and friends. Let’s communicate and progress together.