13 Constructing a Modular Python Stack

13 Constructing a Modular Python Stack #

Hello, I’m Jingxiao.

This is the last section of the Basics module. So far, you have mastered the basic skills and techniques of Python, left the beginner’s village, and seen a wider world, which has sparked your desire to spar with this world.

Therefore, you may start trying to write some more complex and systematic engineering projects or applications with a large amount of code. At this point, a simple .py file has become too bloated and cannot bear the responsibilities of heavyweight software development.

The main goal of today’s lesson is to simplify complexity by modularizing and organizing functionality into files. This way, you can build different components and functionalities in a large-scale project just like building with blocks.

Simple Modularization #

When it comes to the simplest way of modularization, you can split functions, classes, and constants into different files, put them in the same folder, and then use from your_file import function_name, class_name to call them. Afterwards, these functions and classes can be directly used within the file.

# utils.py

def get_sum(a, b):
    return a + b



# class_utils.py

class Encoder(object):
    def encode(self, s):
        return s[::-1]

class Decoder(object):
    def decode(self, s):
        return ''.join(reversed(list(s)))



# main.py

from utils import get_sum
from class_utils import *

print(get_sum(1, 2))

encoder = Encoder()
decoder = Decoder()

print(encoder.encode('abcde'))
print(decoder.decode('edcba'))

########## Output ##########

3
edcba
abcde

Let’s take a look at the code using this method: the get_sum() function is defined in utils.py, and the Encoder and Decoder classes are in class_utils.py. We directly call from import in the main function to import what we need.

Simple enough.

But is this enough? Of course not. Eventually, you will realize that keeping all files in one folder is not a sustainable solution.

So, let’s try creating some subfolders:

# utils/utils.py

def get_sum(a, b):
    return a + b



# utils/class_utils.py

class Encoder(object):
    def encode(self, s):
        return s[::-1]

class Decoder(object):
    def decode(self, s):
        return ''.join(reversed(list(s)))



# src/sub_main.py

import sys
sys.path.append("..")

from utils.class_utils import *

encoder = Encoder()
decoder = Decoder()

print(encoder.encode('abcde'))
print(decoder.decode('edcba'))

########## Output ##########

edcba
abcde

This time, our file structure looks like this:

.
├── utils
│   ├── utils.py
│   └── class_utils.py
├── src
│   └── sub_main.py
└── main.py

It is easy to see that when main.py calls a module in a subdirectory, all you need to do is use . instead of / to represent the subdirectory. In this case, utils.utils represents the utils.py module in the utils subfolder.

And what if we want to call a module in the parent directory? Note that sys.path.append("..") means that the current location of the program is moved up one level, and then we can call the modules in utils.

One thing to note is that importing the same module will only be executed once, preventing problems caused by duplicate imports. Of course, it is good programming practice to avoid importing code multiple times. In Facebook’s coding style, except for extremely special cases, imports must be at the very beginning of the program.

Finally, I want to mention version differences. You may have seen this requirement in many tutorials: we need to create an __init__.py file in the module’s folder, which can be empty or used to describe the module interfaces exposed by the package. However, in fact, this is the convention for Python 2. In the Python 3 convention, __init__.py is not necessary. This point is not mentioned in many tutorials or not explained clearly, so I hope you will pay attention to this.

Overall, this is the simplest way of module calling. When I first started using Python, this method was enough for me to complete projects during my college years. After all, many school projects have only single-digit numbers of files, and each file has only a few hundred lines of code. This organization method helped me complete tasks smoothly.

But after I joined Facebook, I found that a project workspace of a team may have thousands of files and hundreds of thousands to millions of lines of code. This calling method is no longer sufficient, and it is imperative to learn new ways of organizing modules.

Next, let’s systematically learn the scientifically organized way of modularization.

Project Modularization #

Let’s first review the concepts of relative paths and absolute paths.

In Linux systems, each file has an absolute path that starts with / to represent the path from the root directory to the leaf node. For example, /home/ubuntu/Desktop/my_project/test.py is an absolute path.

In addition, for any two files, there is a path that goes from one file to another. For example, /home/ubuntu/Downloads/example.json. If we want to access example.json from test.py, we would write '../../Downloads/example.json', where .. represents the parent directory. This is called a relative path.

Usually, when a Python file is running, it has a runtime location, which is initially the folder where the file is located. Of course, this runtime location can be changed later. By running sys.path.append(".."), we can change the current location of the Python interpreter. However, in general, I do not recommend it. It is necessary to fix a specific path for large projects.

Now that we have clarified these concepts, it is easy to understand how to set the module path in a project.

Firstly, you will realize that relative positions are not a good choice. Because code may be migrated, relative positions make refactoring both unattractive and error-prone. Therefore, in large projects, it is best to use absolute positions as a top priority. For an independent project, it is preferable to start tracing all module paths from the root directory of the project, which is called a relative absolute path.

In fact, at Facebook and Google, there is only one code repository for the entire company, where all the company’s code is stored. When I first joined Facebook, I found this confusing and fascinating, and naturally there were some concerns:

  • Doesn’t this increase the complexity of project management?
  • Is there a risk of privacy leakage between different code groups?

Later, as I delved into my work, I discovered several advantages unique to this code repository.

The first advantage is simplified dependency management. All code modules of the entire company can be called by any program you write, and the libraries and modules you write will also be called by others. The way to call them is by indexing from the root directory of the code, as mentioned earlier, using relative absolute paths. This greatly enhances code sharing and reusability. You don’t need to reinvent the wheel, you just need to search for existing packages or frameworks before you start coding.

The second advantage is unified versions. There is no situation where using a new module causes a series of functions to crash, and all upgrades need to pass unit tests to continue.

The third advantage is code tracing. You can easily trace where an API is called from and how its versions have iteratively developed and changed.

If you are interested, you can refer to this paper: https://cacm.acm.org/magazines/2016/7/204032-why-google-stores-billions-of-lines-of-code-in-a-single-repository/fulltext

When working on a project, although it is not possible to put all the world’s code into one folder, it is still necessary to have a modularization mindset-like approach. That is, use the root directory of the project as the basic directory, and import all modules by indexing down through each layer from the root directory.

Once you understand this point, let’s use PyCharm to create a project this time. The project structure is as follows:

.
├── proto
│   ├── mat.py
├── utils
│   └── mat_mul.py
└── src
    └── main.py
# proto/mat.py

class Matrix(object):
    def __init__(self, data):
        self.data = data
        self.n = len(data)
        self.m = len(data[0])
# utils/mat_mul.py

from proto.mat import Matrix

def mat_mul(matrix_1: Matrix, matrix_2: Matrix):
    assert matrix_1.m == matrix_2.n
    n, m, s = matrix_1.n, matrix_1.m, matrix_2.m
    result = [[0 for _ in range(n)] for _ in range(s)]
    for i in range(n):
        for j in range(s):
for k in range(m):
    result[i][k] += matrix_1.data[i][j] * matrix_2.data[j][k]

return Matrix(result)


# src/main.py

from proto.mat import Matrix
from utils.mat_mul import mat_mul


a = Matrix([[1, 2], [3, 4]])
b = Matrix([[5, 6], [7, 8]])

print(mat_mul(a, b).data)

########## Output ##########

[[19, 22], [43, 50]]

This example is very similar to the previous example, but please note that utils/mat_mul.py, you will see that it imports Matrix as from proto.mat. This approach directly imports from the project root directory and then imports the Matrix module from the mat.py file in the proto module, instead of using .. to import from the parent folder.

Isn’t it simple? You can use PyCharm to build all future projects like this. Place different modules in different subfolders and cross-module calls can be indexed directly from the top level, which is very convenient.

I guess your curiosity has arisen. You try to enter the src folder using the command line and directly enter python main.py, but it gives an error, saying it can’t find proto. You are not satisfied, so you go back to the previous directory and enter python src/main.py, but it continues to give an error, saying it can’t find proto.

What kind of black magic is PyCharm using?

In fact, when the Python interpreter encounters an import, it looks for modules in a specific list. You can obtain this list using the following code:

import sys

print(sys.path)

########## Output ##########

['', '/usr/lib/python36.zip', '/usr/lib/python3.6', '/usr/lib/python3.6/lib-dynload', '/usr/local/lib/python3.6/dist-packages', '/usr/lib/python3/dist-packages']

Please note that the first item is empty. Actually, what PyCharm does is set the first item to the absolute path of the project root directory. This way, every time you run main.py, the import function will look for the corresponding package in the project root directory.

You may ask, can you modify it so that the regular Python runtime environment can do the same? There are two methods to achieve this:

import sys

sys.path[0] = '/home/ubuntu/workspace/your_projects'

The first method, “go big or go home,” is to forcefully modify this position so that your import will be smooth. However, this is obviously not the optimal solution. Writing an absolute path in the code is not recommended (you can write it to a configuration file, but finding the configuration file also requires path searching, leading to an unsolvable loop).

The second method is to modify PYTHONHOME. Here I will briefly mention Python’s Virtual Environment. Python can create a brand new Python runtime environment using the Virtualenv tool, which is very convenient.

In fact, we advocate having a separate runtime environment for each project to maintain the integrity of packages and modules. The deeper details are beyond the scope of today’s discussion; you can research on your own.

Returning to the second method, in a Virtual Environment, you can find a file called activate. At the end of this file, add the following content:

export PYTHONPATH="/home/ubuntu/workspace/your_projects"

This way, every time you activate this runtime environment using activate, it will automatically add the project’s root directory to the search path.

The Magic of if __name__ == '__main__' #

In the last section, let’s talk about if __name__ == '__main__', which is a commonly seen syntax.

Python is a script language, and the biggest difference from C++ or Java is that it doesn’t require an explicit main() function entry. If you have experience with languages like C++ or Java, you should be familiar with the structure like main() {}, right?

Since Python allows writing code directly, besides making Python code look better (more like C++), what other benefits does the if __name__ == '__main__' syntax have?

Here is the project structure:

.
├── utils.py
├── utils_with_main.py
├── main.py
└── main_2.py



# utils.py

def get_sum(a, b):
    return a + b

print('testing')
print('{} + {} = {}'.format(1, 2, get_sum(1, 2)))



# utils_with_main.py

def get_sum(a, b):
    return a + b

if __name__ == '__main__':
    print('testing')
    print('{} + {} = {}'.format(1, 2, get_sum(1, 2)))



# main.py

from utils import get_sum

print('get_sum: ', get_sum(1, 2))

########## Output ##########

testing
1 + 2 = 3
get_sum: 3



# main_2.py

from utils_with_main import get_sum

print('get_sum: ', get_sum(1, 2))

########## Output ##########

get_sum_2: 3

With this project structure, it should be clear to you.

When importing a file, the import statement automatically executes all the code that is exposed. Therefore, if you want to encapsulate something as a module and still want it to be executable, you must put the code to be executed under if __name__ == '__main__'.

Why? In fact, __name__ is a magic built-in parameter in Python and essentially an attribute of the module object. When we use the import statement, __name__ will be assigned the name of that module, and of course, it won’t be equal to __main__. I won’t go into more details of the underlying mechanism, but you just need to understand this concept.

Summary #

In today’s class, I have explained to you how to use Python to build modular and large-scale projects. Here are a few key points to emphasize:

  1. By using absolute and relative paths, we can import modules.
  2. Modularization is crucial in large-scale projects, and the indexing of modules should be done using absolute paths, which start from the root directory of the program.
  3. Remember to use if __name__ == '__main__' to avoid execution during import.

Thought-provoking question #

Lastly, I’d like to leave you with a thought-provoking question. What is the difference between from module_name import * and import module_name? Feel free to leave a comment and share your thoughts with me. You are also welcome to share this article with your colleagues and friends.