16 In-depth Analysis of Linux Container Technology #

The first two parts covered resource isolation, limitations, and the composition of container images. In the third part, we will use a well-established container engine in the industry as an example to explain the composition of container engines.

Containers #

Containers are a lightweight virtualization technology that eliminates the need for a hypervisor layer when compared to virtual machines. Let’s take a look at the following diagram depicting the startup process of a container.

avatar

At the bottom, there is a disk where the container image is stored. On top of that, there is a container engine, which can be Docker or any other container engine. The engine sends a request, such as creating a container, and then it runs the container image from the disk as a process on the host machine.

For containers, the most important aspect is how to ensure that the resources used by this process are isolated and limited. This is achieved in the Linux kernel through two technologies: cgroups and namespaces. Next, using Docker as an example, we will delve into the details of resource isolation and container images.

1. Resource Isolation and Limitation #

Namespace #

Namespace is used for resource isolation. In the Linux kernel, there are seven types of namespaces, and Docker uses the first six. The seventh type, cgroup namespace, is not used in Docker itself, but it is implemented in runC.

avatar

Let’s look at them one by one:

The first one is the mount namespace. The mount namespace ensures that the container sees its own file system view provided by the container image. It means that the container cannot see other files on the host, except for the ones that are exposed through the -v parameter.
The second one is the UTS namespace, which isolates the hostname and domain.
The third one is the PID namespace, which ensures that the init process of the container starts as process ID 1.
The fourth one is the network namespace. Except for using the host network mode, all other network modes have their own network namespace.
The fifth one is the user namespace, which controls the mapping of user UID and GID between the container and the host. However, this namespace is less commonly used.
The sixth one is the IPC namespace, which controls inter-process communication, such as semaphore.
The seventh one is the cgroup namespace. The images on the right side of the diagram represent enabling and disabling the cgroup namespace. The benefit of using cgroup namespace is that the container sees the cgroup view as a root, which is the same as the view seen by the processes on the host. Another benefit is that using cgroup inside the container becomes more secure.

Now let’s briefly demonstrate the process of creating namespaces using the unshare command. The creation of namespaces in containers is actually done using the unshare system call.

avatar

The top half of the image shows an example of using unshare, and the bottom half shows an actual example of creating a PID namespace using the unshare command. As you can see, the bash process is now in a new PID namespace, and the ps command shows that the PID of this bash process is 1, indicating that it is in a new PID namespace.

cgroup #

Two cgroup Drivers #

Cgroup is mainly used for resource limitation. Docker containers have two cgroup drivers: systemd and cgroupfs.

avatar

cgroupfs is relatively easy to understand. For example, if you want to limit the memory usage or set CPU shares, you can directly write the PID into the corresponding cgroup file. Then write the corresponding resource limit into the memory cgroup file and the CPU cgroup file.
The other one is a systemd cgroup driver. This driver is used when systemd itself provides a cgroup management mechanism. If you use systemd as the cgroup driver, all cgroup write operations must be done through the systemd interface, and manual modification of cgroup files is not allowed.

Commonly Used cgroups in Containers #

Let’s take a look at the commonly used cgroups in containers. Although the Linux kernel provides many types of cgroups, Docker containers typically use only the following six:

avatar

The first one is CPU, which is used to set CPU shares and cgroups to control CPU usage.
The second one is memory, which controls the amount of memory a process can use.
The third one is device, which controls the devices visible in the container.
The fourth one is freezer. Both the third cgroup (device) and the freezer are for security purposes. When you stop a container, the freezer writes all the processes in the current cgroup and freezes them. The purpose is to prevent any processes from forking during the stop, which could potentially allow processes to escape to the host.
The fifth one is blkio, which mainly limits the IOPS (Input/Output Operations Per Second) and bps (Bytes Per Second) of disk usage in the container. However, if cgroup is not unique, Docker I/O cannot be limited.
The sixth one is the PID cgroup, which limits the maximum number of processes that can be used in the container.

Less Commonly Used cgroups #

There is also a group of cgroups that Docker containers do not use. The distinction between commonly used and less commonly used cgroups is specific to Docker, because for runC, all cgroups except for rdma at the bottom are supported. However, Docker does not enable support for these cgroups shown in the image below.

avatar

2. Container Image #

docker images #

Next, let’s talk about container images using Docker images as an example.

Docker images are based on the concept of Union File System. To put it simply, it allows files to be stored in different layers, but can be viewed as a unified whole.

avatar

As shown in the above image, on the right is a structure diagram of container storage taken from the Docker official website. This diagram illustrates Docker’s storage mechanism, which is based on Union File System and is layered. Each layer is called a “Layer” and is made up of different files, which can be reused by other images. When an image is run as a container, the top layer becomes a container’s read-write layer. This read-write layer can also be turned into the latest top layer of an image by using the “commit” command.

For the storage of Docker images, the underlying layer is based on different file systems, so the storage driver is customized for different file systems, such as AUFS, btrfs, devicemapper, and overlay. Docker has corresponding graph drivers for these file systems, which store images on disk using these drivers.

Using “overlay” as an example #

Storage Process #

Next, let’s take the overlay file system as an example to see how Docker images are stored on disk. First, take a look at the following diagram, which briefly describes how the overlay file system works.

avatar

The bottom layer is a “lower” layer, which is the image layer and is read-only. The top right layer is an “upper” layer, which is the container’s read-write layer. The upper layer adopts a copy-on-write mechanism, which means that a file is only copied from the lower layer to the upper layer when it needs to be modified. After that, all subsequent modifications will be made to the copy in the upper layer.

Beside the upper layer, there is a “workdir” which acts as an intermediate layer. When modifying a copy in the upper layer, it is first placed in the workdir, and then moved from the workdir to the upper layer. This is the working mechanism of overlay.

The topmost layer is the “mergedir”, which is a unified view layer. The mergedir shows the integration of all data from the upper and lower layers. When executing “docker exec” inside a container, what we see as the file system is actually the mergedir’s unified view layer.

File Operations #

Next, let’s talk about how to manipulate the files inside a container using the overlay storage mechanism.

avatar

Let’s start with read operations. When a container is newly created, the upper layer is empty. So, if we read the files at this point, all the data will be read from the lower layer.

As for write operations, as mentioned earlier, the overlay upper layer has a copy-on-write mechanism. When a file needs to be modified, overlay will perform a “copy up” action and copy the file from the lower layer to the upper layer. All subsequent write operations will be performed on this copied file.

Now, let’s move on to deletion operations. In overlay, there is no real deletion operation. The so-called deletion is essentially a marking of files. When looking at these files from the topmost unified view layer, if a file is marked, it will be displayed and considered as deleted. There are two ways to mark the file:

One way is through the “whiteout” method.
The second is by setting an extended permission for the directory to indicate its deletion.

Steps to Perform #

Next, let’s see how Docker run actually starts a container with busybox, and what does its overlay mounting point look like.

avatar

The second image shows the “mount” command, which displays the mounting of the container’s rootfs, using overlay as the type of mount. It includes three layers: upper, lower, and workdir.

Next, let’s take a look at the writing of new files inside the container. By using “docker exec” to create at new file, we can see from the “diff” command (as shown in the image) that the file is in its “upperdir”. We can also find the file within the upperdir and see that its content is the result of the “docker exec” write operation.

Finally, at the bottom, we have the mergedir which integrates the contents of the upper and lower directories. Here, we can also see the data we wrote.

Section 3: Container Engine #

Explaining the Containerd Container Architecture #

Next, let’s talk about container engines. We will discuss the architecture of a container engine called containerd, which is based on the CNCF. The following diagram is taken from the containerd official website and provides an overview of the container engine’s architecture.

avatar

If we divide the diagram into two parts, we can say that containerd provides two major functionalities.

The first one is for the runtime, which manages the lifecycle of containers. The left side of the diagram represents the storage component, which manages image storage. Containerd is responsible for image pulling and storage.

In terms of horizontal layers:

The first layer is GRPC. Containerd provides services to the upper layer through the GRPC server. The Metrics section mainly provides cgroup metrics.
The left side of the layer below represents the storage for container images. The middle section contains images and containers, whereas the Metadata is stored in bootfs on the disk. On the right side, Tasks manage the container’s structure, and Events are triggered for container operations. These events can be subscribed to by the upper layer, allowing them to know the changes in the container’s status.
The bottom layer is the Runtimes layer. Runtimes can be distinguished by type, such as runC or secure containers.

What are Shim V1/V2? #

Now let’s discuss the runtime architecture of containerd. The following diagram is taken from the kata official website. The top half shows the original diagram, while the bottom half includes additional examples. This diagram illustrates the runtime architecture of containerd in more detail.

avatar

As shown in the diagram, from left to right, from the upper layer to the final runtime execution, the flow of a container can be explained.

Let’s start with the leftmost part, which is the CRI Client. Usually, kubelet sends requests to containerd through CRI. After receiving a container request, containerd passes it through a containerd shim. The containerd shim manages the container lifecycle and is responsible for two things:

The first is forwarding I/O.
The second is passing signals.

The top half of the diagram shows the flow for secure containers, specifically kata. We won’t go into details here. In the bottom half, you can see various types of shims. Now let’s explore the architecture of the containerd shim.

Initially, containerd had only one shim, denoted by the blue box containerd-shim. This process, regardless of whether it is kata containers, runc containers, or gvisor containers, uses the containerd shim.

Later, containerd was extended for different types of runtimes. This extension was done using the shim-v2 interface. Thus, by implementing this shim-v2 interface, different runtimes can customize their own shim. For example, runC can create a shim called shim-runc, gvisor can create a shim called shim-gvisor, and kata can create a shim called shim-kata. These custom shims can replace the blue box containerd-shim mentioned earlier.

There are several benefits to this approach. Here’s an analogy to help illustrate: looking at the kata diagram, initially, if we use shim-v1, we would have three components. The reason for these three components is due to a limitation of kata itself. However, with the shim-v2 architecture, these three components can be bundled into a single binary. This means the three previously separate components can now be combined into one shim-kata component, which demonstrates one of the advantages of shim-v2.

Explaining the Containerd Container Architecture - Container Workflow Example #

Next, let’s explain how container workflows work using two examples. The following two diagrams depict the workflow of a container based on the containerd architecture.

Start Workflow #

Let’s start with the workflow for starting a container:

avatar

This diagram consists of three parts:

The first part represents the container engine, which can be Docker or others.
The dashed boxes represent containerd and containerd-shim, which are part of the containerd architecture.
At the bottom is the container part, which is launched by a runtime and can be seen as the shim operating the runC command to create the container.

First, let’s understand how this workflow operates. The numbers 1, 2, 3, and 4 in the diagram represent the steps taken by containerd to create a container.

First, it creates metadata and then sends a request to the task service to create a container. Through a series of components, the request is finally sent to a shim. The interaction between containerd and the shim is done through GRPC. Once containerd sends the creation request to the shim, the shim calls the runtime to create a container. These are the steps involved in starting a container.

Exec Workflow #

Now let’s look at how to execute a command within a container. The workflow is very similar to the start workflow, and the structure is also similar. The only difference is how containerd handles this workflow. Similar to the previous diagram, I have labeled the steps 1, 2, 3, and 4 in this diagram, representing the sequence of steps taken by containerd for the exec workflow.

avatar

As shown in the diagram, the execution operation is still sent to the containerd-shim. From the perspective of a container, there is no significant difference between starting a container and executing a command within a container.

The main difference lies in whether a namespace needs to be created for the process running inside the container:

When executing a command, the process needs to be added to an existing namespace.
When starting a container, the namespace for the container’s process needs to be specially created.

Summary #

In conclusion, after reading this section, it is hoped that students will have a deeper understanding of Linux containers. Here is a brief summary:

How containers use namespaces for resource isolation and cgroups for resource limitation.
A simple introduction to container image storage based on the overlay file system.
An example using docker+containerd to illustrate how container engines work.