04 Understanding Pod and Container Design Patterns

04 Understanding Pod and Container Design Patterns #

在 Kubernetes 中，Pod 是最小的可部署对象。一个 Pod 封装了一个或多个紧密相关的容器，共享相同的网络命名空间和数据卷。Pod 提供了一种将容器组织在一起的方式，以便它们可以协同工作。

Pod的设计模式通常遵循以下几种模式之一：

单容器模式：Pod 中只有一个容器。这种模式非常简单，适用于只有一个应用程序必须运行的场景。
边车模式：Pod 中有一个主容器和一个或多个边车容器。主容器通常是应用程序容器，而边车容器提供了辅助服务，如日志收集、监控等。边车容器和主容器共享相同的网络和存储资源。
使命模式：Pod 中的容器共享相同的 Linux 命名空间，并且通过基于信号的通信进行协调。这种模式通常用于运行需要相互通信的容器，例如一个主容器和多个从容器。
共享数据模式：Pod 中的容器共享一个或多个数据卷，以方便它们之间的数据共享。这种模式通常用于共享配置文件、数据库等数据。

当选择适当的 Pod 和容器设计模式时，需要考虑以下因素：

适应性：设计模式是否适合现有的应用程序和业务需求？
可扩展性：能否轻松地增加或减少容器的数量？
灵活性：是否可以方便地修改或替换容器？
效率：设计模式是否能够最有效地利用资源？

理解 Pod 和容器设计模式对于在 Kubernetes 上开发和部署应用程序非常重要。通过选择合适的设计模式，可以更好地利用容器化技术的优势，并提供可维护和可扩展的应用程序。

Why do we need Pods #

Basic concept of containers #

Let’s start with the first question: why do we need Pods? We know that Pods are a very important concept in the Kubernetes project and serve as atomic scheduling units. But why do we need such a concept when using Docker containers? The answer lies in understanding the nature of containers. So let’s review the concept of containers:

The essence of a container is actually a process that is isolated and resource-constrained.

The process with PID=1 inside the container is the application itself, which means that managing containers is equivalent to managing the infrastructure, because we are managing the machine. However, managing Pods is equivalent to directly managing the application. This is the best example of immutable infrastructure, where your application is your infrastructure, and it must be immutable.

Based on the context of the above example, what is Kubernetes? Many people say that Kubernetes is the operating system of the cloud era, which is quite interesting because, by extension, container images are like software packages for this operating system, and they have this kind of analogy.

avatar

Real-world examples in operating systems #

If Kubernetes is considered an operating system, let’s take a look at real-world examples from operating systems.

There is a program called Helloworld in the example, which is actually composed of a group of processes. It is worth noting that in this case, processes are equivalent to threads in Linux.

Since threads in Linux are lightweight processes, if you check the pstree of Helloworld in a Linux system, you will see that Helloworld is actually composed of four threads, namely {api, main, log, compute}. In other words, these four threads collaborate and share the resources of the Helloworld program, forming the actual working situation of the Helloworld program.

This is a very real example of a process group or thread group in an operating system. The concept of a process group is illustrated above.

avatar

Now, let’s think about it: in a real operating system, a program is often managed based on process groups. Kubernetes compares itself to an operating system, such as Linux. As for containers, as we mentioned earlier, they can be seen as processes, which are like the Linux threads we mentioned earlier. So, what is a Pod? In fact, a Pod is the process group we just mentioned, which is equivalent to a thread group in Linux.

Concept of process groups #

When it comes to process groups, it is recommended to have a conceptual understanding before diving into the details.

Let’s use the previous example again: The Helloworld program consists of four processes that share some resources and files. Now, here’s a question: If we were to run the Helloworld program in a container, how would you do it?

The most natural solution would be to start a Docker container and run four processes inside it. However, this raises a problem: In this case, which process should be the one with PID=1 inside the container? For example, it should be the main process, but then, who is responsible for managing the remaining three processes?

The core problem here is that the design of containers is based on a “single-process” model. This doesn’t mean that containers can only have one process; rather, it means that the container application itself is equivalent to a process, and therefore, it can only manage the process with PID=1, while the other processes are in a managed state. So, the service application process itself has the ability to manage processes.

For example, the Helloworld program may have the capability of system calls, or we can directly change the PID=1 process inside the container to systemd. Otherwise, the application or container won’t be able to manage multiple processes. Because the PID=1 process is the application itself, if we kill it or it dies during runtime, the resources of the remaining three processes won’t be reclaimed, which is a very serious problem.

On the other hand, if we actually change the application itself to systemd, or run a systemd process inside the container, it will lead to another problem: managing the container will no longer mean managing the application itself, but rather managing systemd. This problem becomes apparent in this scenario. For example, if the program or process running in the container is systemd, then is the application exiting? Is it failing? Is it encountering any exceptions? In reality, it is impossible to directly know, because the container is managing systemd. This is why running a complex program inside a container can be difficult.

To summarize: Due to the “single-process” model of Linux containers, if you start multiple processes inside a container, only one of them can be considered as the process with PID=1. In this case, if this PID=1 process fails or exits, the other three processes will become orphaned naturally, with no one to manage or reclaim their resources. This is not an ideal situation.

Note: The “single-process” model of Linux containers refers to the fact that the lifecycle of the container is equivalent to the lifecycle of the process with PID=1 (the container application process), and it doesn’t mean that containers cannot have multiple processes inside them. However, in general, the container application process doesn’t have process management capabilities. So, if you create other processes inside the container using exec or ssh, they can easily become orphaned processes if they exit abnormally (e.g., ssh termination).

On the flip side, it is actually possible to run systemd inside a container to manage all the other processes. However, this leads to a second problem: it becomes difficult to directly manage the application itself because it is taken over by systemd. In this case, the lifecycle of the application state no longer matches the container lifecycle. This management model can become very complicated.

avatar

Pod = “Process Group” #

In Kubernetes, a Pod is actually an abstraction of a concept similar to a process group.

As mentioned earlier, the Helloworld application, which consists of four processes, would be defined as a Pod with four containers in Kubernetes. It is crucial to understand this concept thoroughly.

This means that the four processes with different responsibilities and cooperative interactions need to be run in separate containers in Kubernetes. Because doing otherwise would encounter two problems. So, in Kubernetes, how would this be done? It would start four independent containers for the four individual processes, and then define them within a Pod.

Therefore, when Kubernetes brings up the Helloworld application, you would actually see four containers. They share certain resources attributed to the Pod. Hence, we can say that, in Kubernetes, a Pod is just a logical unit without a physical presence. The real physical entities are the four containers. The combination of these containers, or multiple containers, is called a Pod. Furthermore, a concept that must be clear is that a Pod is the unit of resource allocation in Kubernetes because the containers inside it need to share specific resources. Thus, a Pod is also the atomic scheduling unit in Kubernetes. avatar

The Pod design mentioned above is not something that Kubernetes came up with on its own, but rather a problem that was discovered during the development of Borg by Google. This is very clearly described in the Borg paper. In short, Google engineers found that when deploying applications under Borg, there were often “process-group-like” relationships. More specifically, these applications often have close collaborative relationships, requiring them to be deployed on the same machine and share certain information.

This is the concept of a process group, and it is also the usage of Pods.

Why must a Pod be an atomic scheduling unit? #

You may have some questions at this point: Although we understand that a Pod is a process group, why do we need to abstract the concept of a Pod? Can’t we solve this problem through scheduling? Why must a Pod be the atomic scheduling unit in Kubernetes?

Let’s explain this through an example.

Suppose we have two containers that are tightly coupled and should be deployed in a Pod together. Specifically, the first container is called App, which is the business container and writes log files. The second container is called LogCollector, which forwards the log files written by the App container to the backend ElasticSearch.

The resource requirements of the two containers are as follows: the App container requires 1GB of memory, and the LogCollector requires 0.5GB of memory. And the available memory in the current cluster environment is as follows: NodeA: 1.25GB of memory, NodeB: 2GB of memory.

If there were no Pod concept and only two containers, and these two containers needed to run closely collaboratively on a single machine, what would happen if the scheduler first schedules the App to NodeA? You will find that the LogCollector cannot be scheduled to NodeA because there is not enough resources. In fact, the entire application is already broken at this point, and the scheduling has failed and needs to be rescheduled.

avatar

This is a very typical example of a co-scheduling failure. This problem is called the task co-scheduling problem. It does have solutions in many projects.

For example, in Mesos, it does something called resource hoarding: that is, it starts the unified scheduling only when all tasks with affinity constraints are ready. This is a very typical solution to the co-scheduling problem.

So in Mesos, the “App” and “LogCollector” containers would not be immediately scheduled, but would wait until both containers are submitted before starting the unified scheduling. This approach also brings new problems. First, scheduling efficiency is reduced because of the waiting time. Additionally, waiting may lead to deadlocks, resulting in a situation where tasks are mutually waiting. These mechanisms need to be addressed in Mesos and add additional complexity.

Another approach is Google’s solution. In the Omega system (the next generation of Borg), a very complex and powerful solution called optimistic scheduling is implemented. For example, regardless of these conflicting exceptional situations, scheduling is performed first, while a sophisticated rollback mechanism is set up to resolve conflicts through rollback. This method is relatively more elegant and efficient, but its implementation mechanism is very complex. Many people can understand that setting pessimistic locks is definitely simpler than setting optimistic locks.

A task co-scheduling problem like this is directly solved in Kubernetes through the concept of a Pod. In Kubernetes, the App container and LogCollector container are definitely part of the same Pod. They are scheduled as a unit based on a Pod, so this problem does not exist at all.

Understanding Pods Again #

After discussing these points, let’s understand Pods again. First of all, the containers in a Pod have an “ultra-intimate relationship”.

There is a “ultra” word here that needs to be understood. Normally, there is a relationship called an intimate relationship, which can definitely be solved through scheduling.

avatar

For example, if there are two Pods that need to run on the same host, this belongs to an intimate relationship that the scheduler can handle. However, for an ultra-intimate relationship, there is a problem: it must be solved through Pods. Because if the ultra-intimate relationship is not specified, the entire Pod or application cannot start.

What does an ultra-intimate relationship mean? It can be divided into the following categories:

For example, two processes will exchange files. The previous example is like this, one writes logs and the other reads logs.
Two processes need to communicate through localhost or local sockets. This kind of local communication is also an ultra-intimate relationship.
Two containers or microservices need to make very frequent RPC calls. For performance reasons, they also need to have an ultra-intimate relationship.
Two containers or applications need to share certain Linux namespaces. The simplest and most common example is that one container needs to join another container’s network namespace. This way, I can see the network devices of the other container and its network information.

Relationships like these mentioned above belong to ultra-intimate relationships, and they are addressed through the concept of a Pod in Kubernetes.

Now that we understand the concept and design of Pods and why Pods are needed, they solve two problems:

How to describe ultra-intimate relationships.
How to perform unified scheduling for containers or businesses with ultra-intimate relationships. This is the primary requirement of Pods.

Mechanism of Pod implementation #

Problems to be solved by Pod #

Pod is a logical concept. How it is implemented on a machine is the second question we need to explain.

Since Pod is designed to solve this problem, the key lies in how to efficiently share certain resources and data among multiple containers within a Pod.

Because containers are originally separated by Linux Namespace and cgroups, the actual challenge now is how to break this isolation and share certain things and information. This is the core problem that the design of Pod needs to address.

Therefore, the specific solutions are divided into two parts: networking and storage.

The first question is how the multiple containers in a Pod can share networking. Here is an example:

For example, let’s say there is a Pod that includes two containers, container A and container B. They both need to share the Network Namespace. In Kubernetes, the solution is as follows: an additional Infra container is started in each Pod to share the entire Pod’s Network Namespace.

The Infra container is a very small image, about 100~200KB, written in assembly language, and is always in a “paused” state. With this Infra container, all other containers will join the Infra container’s Network Namespace through the Join Namespace approach.

Therefore, all containers within a Pod have the same network view. That is, the network devices, IP addresses, MAC addresses, and other network-related information they see are all the same, and they all come from the Infra container created when the Pod was first created. This is one of the ways Pod solves network sharing.

In a Pod, there must be an IP address, which corresponds to the Network Namespace of the Pod and is also the IP address of the Infra container. So what everyone sees is the same, and all other network resources are shared by all containers within the Pod. This is how Pod implements networking.

Because there needs to be an intermediate container, the Infra container must be started first in the entire Pod. The lifecycle of the entire Pod is the same as the Infra container and is independent of containers A and B. This is why Kubernetes allows you to update a specific image inside a Pod separately. This operation will not rebuild or restart the entire Pod, which is a very important design.

avatar

The second question is how Pod shares storage, which is relatively simple.

For example, there are two containers, one is Nginx, and the other is a regular container. I put some files in Nginx so that I can access them through Nginx. So it needs to share this directory. Sharing files or directories inside a Pod is very simple. In fact, it turns the volume into a Pod-level volume. Then all containers, which are all containers belonging to the same Pod, will share all volumes.

avatar

In the above example, the volume is called “shared-data”, which belongs to the Pod level. So in each container, you can directly declare to mount the “shared-data” volume. As long as you declare that you mount this volume, when you look at this directory in the container, it is actually the same for everyone. This is how Kubernetes allows containers to share storage through Pods.

So in the previous example, if the application container “App” writes logs, as long as these logs are written in a volume and you declare to mount the same volume, this volume can be immediately seen by another container, such as the “LogCollector”. These are the ways that Pod realizes storage implementation.

Explanation of Container Design Patterns #

Now that we understand why Pods are necessary and how Pods are implemented, let’s dive into a concept that Kubernetes highly advocates for: container design patterns.

Example #

Next, we will use an example to illustrate this concept.

Let’s say I have a common requirement: I need to deploy an application written in Java. This application consists of a WAR package that needs to be placed in Tomcat’s web APP directory in order to start it. However, when it comes to packaging a WAR package or a container like Tomcat, how should we approach it? There are a few ways to do it.

avatar

The first approach is to package the WAR package and Tomcat into one image. However, this presents a problem: the image actually combines two different components. As a result, whether we want to update the WAR package or Tomcat, we have to create a new image each time, which can be quite cumbersome.
The second approach is to only package Tomcat in the image. This means that the image only contains Tomcat. However, we need to use a method called using a data volume, such as a hostPath, to mount the WAR package from the host machine into the Tomcat container, specifically into the web APP directory. By enabling this container, we can utilize the contents inside.

However, we encounter a problem with this approach: it requires maintaining a distributed storage system. Since containers are portable and their state is not persistent (meaning they might run on host machine A the first time and then move to host machine B when restarted), we must maintain a distributed storage system so that the container can find the WAR package and its data regardless of whether it is on host machine A or host machine B.

Note that even with a distributed storage system for volumes, you still need to maintain the WAR package inside the volume. For example, you would need to write a dedicated Kubernetes Volume plugin that downloads the WAR package required for application startup into the volume before each Pod starts, so that it can be mounted and used by the application.

This approach adds a level of complexity, and the container itself relies on a set of persistent storage plugins (used to manage the content of the volume that contains the WAR package).

InitContainer #

Therefore, have you ever considered whether there is a more general approach to achieve this kind of combination, even without a distributed storage system when using Kubernetes locally?

Actually, there is a method. In Kubernetes, a combination approach like the one mentioned above is called an Init Container.

avatar

Let’s use the same example: in the YAML file shown in the above diagram, we first define an Init Container. Its sole task is to copy the WAR package from the image into a volume. Once it finishes this operation, it exits. Therefore, the Init Container starts before the user containers and strictly follows the defined order of execution.

The key here lies in the destination directory where the copy operation just completed: the APP directory is actually a volume. As we mentioned earlier, multiple containers within a Pod can share volumes. So now, the Tomcat container only packages the Tomcat image. However, during startup, it declares the usage of the APP directory as a volume and mounts it under the Web APP directory.

By the time the second step initiates and starts the Tomcat container, it will find the previously copied sample.war file in the volume. This ensures that the sample.war file exists in the volume. Therefore, when mounting the volume, any Tomcat container will be able to find the previously copied sample.war.

We can describe it as follows: this Pod is self-contained and can be successfully deployed on any Kubernetes instance in the world without requiring a distributed storage system or persistent volumes.

This is an example of combining two containers with different roles and using an orchestration method like Init Container to package an application using Pods. This concept is a very typical container design pattern in Kubernetes called “Sidecar”.

Container Design Pattern: Sidecar #

What is Sidecar? It means that within a Pod, we can define specialized containers to perform auxiliary tasks required by the main business container. As we mentioned in the previous example, we essentially did one thing: the Init Container, which is a Sidecar responsible for copying the WAR package from the image to the shared directory so that Tomcat can use it. What other operations are there? For example:

The things that originally need to be done inside the container to execute SSH can be solved by writing scripts or some preconditions, which can be solved through methods like Init Container or Sidecar;
Another typical example is log collection. Log collection itself is a process, a small container, so it can be packaged into a Pod to do this collection work;
Another very important thing is debugging applications. In fact, the entire application can now be debugged by defining an additional small container in the application Pod, which can execute in the application pod’s namespace;
Checking the working status of other containers, this is also something it can do. There is no need to SSH into the container anymore to check, just install the monitoring component into an additional small container, then start it as a Sidecar and cooperate with the main business container. Therefore, business monitoring can also be done through the Sidecar approach.

One very obvious advantage of this approach is that it actually decouples the auxiliary functions from my business container. So I can independently release the Sidecar container, and more importantly, this capability can be reused. That is, the same monitoring Sidecar or log Sidecar can be shared by the whole company. This is the power of design patterns.

avatar

Sidecar: Application and Log Collection #

Next, let’s further elaborate on the Sidecar pattern. It has some other use cases.

For example, as mentioned earlier, application log collection. The business container writes logs to a Volume, and since the Volume is shared within the Pod, the log container - that is, the Sidecar container - can definitely access the Volume and directly read the log files, and then store them in remote storage, or forward them to another place. This is the basic working method of commonly used log processes or components like Fluentd in the industry.

avatar

Sidecar: Proxy Container #

The second use of Sidecar can be called the Proxy container. What is a Proxy container?

Suppose there is a Pod that needs to access an external system or some external services, but these external systems form a cluster. In this case, how can they be accessed through a unified and simple way, using just one IP address? There is a method: modifying the code. Because the code records the addresses of these clusters. Another decoupling method is to use a Sidecar proxy container.

In simple terms, write a small Proxy separately, which is used to handle the connection to the external service cluster. It only needs to expose one IP address externally. So next, the business container mainly accesses the Proxy, and then the Proxy connects to these service clusters. The key here is that multiple containers inside the Pod communicate directly through localhost because they belong to the same network namespace and have the same network view. So they communicate with each other using localhost without any performance loss.

Therefore, besides decoupling, the Proxy container does not reduce performance. What’s more, the code of such a Proxy container can be reused by the whole company.

avatar

Sidecar: Adapter Container #

The third design pattern of Sidecar is the Adapter container. What is an Adapter container?

Now the API exposed by the business can have a format called A, but there is an external system that wants to access my business container, and it only knows the format called API B. So a solution is needed to change how the business container works, which means changing the business code. But in reality, you can use an Adapter to do this conversion for you.

avatar

Here’s an example: The monitoring interface exposed by the business container is /metrics, and by accessing this URL, you can get the metrics. However, the monitoring system has been upgraded, and it now accesses the URL /health. I only recognize the URL that exposes the health check, and only then can I monitor it. I don’t recognize the metrics. What can be done about this? It requires changing the code, but it is possible to write an additional Adapter that forwards all requests for health to metrics. The Adapter exposes a URL for monitoring called health, and that’s it. Your business can work again.

The key here is that the containers in the Pod communicate directly through localhost, so there is no performance loss, and such an Adapter container can be reused by the whole company. These are the benefits brought by design patterns.

Summary of this section #

Pod is the core mechanism for implementing “container design patterns” in the Kubernetes project.
“Container design patterns” are one of the best practices for managing large-scale container clusters like Google Borg, and they are also a fundamental dependency for orchestrating complex applications in Kubernetes.
The essence of all “design patterns” is decoupling and reusability.