24 Persistent Volume How to Solve the Problem of Data Persistence

24 PersistentVolume How to Solve the Problem of Data Persistence #

Hello, I am Chrono.

After studying the “Beginner’s Guide” and the “Intermediate Guide,” I believe you have a comprehensive understanding of Kubernetes. Now, in the “Advanced Guide,” we will delve deeper into Kubernetes and explore more advanced knowledge and application techniques.

Let’s start with PersistentVolume today.

As early as the [Lesson 14], when we introduced ConfigMap/Secret, we came across the concept of Volume storage in Kubernetes. It uses the fields volumes and volumeMounts to mount a “virtual disk” to the Pod, injecting configuration information into the Pod as files for the processes to use.

However, at that time, Volumes could only store a small amount of data and were far from being a true “virtual disk.”

Today, let’s explore advanced usage of Volumes together and learn about the API objects that Kubernetes uses to manage storage resources - PersistentVolume, PersistentVolumeClaim, and StorageClass. We will also learn how to create actual usable storage volumes using local disks.

What is PersistentVolume #

In the just completed “Intermediate Tutorial” (Lesson 22), we built a WordPress website in a Kubernetes cluster. However, there was a serious problem: the Pods did not have persistent functionality, which resulted in MariaDB not being able to store data permanently.

Because the containers in the Pods are created from images, and image files are read-only, the process can only use temporary storage space to read and write disks. Once the Pod is destroyed, the temporary storage is immediately reclaimed and released, and the data is lost.

In order to ensure that the data still exists even after the Pod is destroyed and rebuilt, we need to find a solution to allow the Pod to use a real “virtual disk”. What should we do?

In fact, Kubernetes’ Volume already provides a good abstraction for data storage. It simply defines a “storage volume”, and we can freely decide the type, capacity, and how to store this “storage volume”. The Pod doesn’t need to care about these professional and complex details, as long as the volumeMounts are properly set, the Volume can be loaded into the container for use.

Therefore, Kubernetes extends the concept of Volume and introduces the PersistentVolume object, which is specifically used to represent persistent storage devices. However, it hides the underlying storage implementation, and all we need to know is that it can safely and reliably store data. (Because the term PersistentVolume is long, it is generally abbreviated as PV.)

So, where do the PVs in the cluster come from?

As an abstraction of storage, PV is actually some storage devices and file systems, such as Ceph, GlusterFS, NFS, or even local disks. Managing them is beyond the abilities of Kubernetes, so they are usually maintained by system administrators and then corresponding PVs are created in Kubernetes.

It is important to note that PV is a system resource of the cluster and is an object at the same level as Nodes. Pods do not have control over PVs, they only have usage rights.

What is PersistentVolumeClaim/StorageClass #

Now that we have PV, can we mount and use it directly in a Pod?

Not yet. Because the differences between different storage devices are too significant: some are fast, some are slow; some can be shared for read and write, some can only be exclusively used for read and write; some have small capacities, only a few hundred MB, some have capacities as large as TB or PB…

With so many types of storage devices, it is a bit too limiting to manage them with just one PV object, and it does not adhere to the principle of “single responsibility”. Allowing Pods to directly select PV is not flexible either. So Kubernetes introduces two new objects, PersistentVolumeClaim and StorageClass, using the “intermediary layer” concept to further refine the process of allocating and managing storage volumes.

Let’s take a look at these two new objects.

PersistentVolumeClaim, abbreviated as PVC, is easy to understand from its name. It is used to request storage resources from Kubernetes. PVC is an object used by Pods, and it acts as a proxy for the Pods to request PV from the system. Once the resource request is successful, Kubernetes will associate the PV and PVC together. This action is called “binding”.

However, there are a lot of storage resources in the system, and it would be very cumbersome for PVC to traverse and find the appropriate PV directly. Therefore, we need to use StorageClass.

StorageClass functions similar to IngressClass mentioned in [Lesson 21]. It abstracts specific types of storage systems (such as Ceph, NFS) and serves as a “coordinator” between PVC and PV, helping PVC find the right PV. In other words, it simplifies the process of Pods mounting “virtual disks” and hides the implementation details of PV from Pods.

If you see this and still find it a bit difficult to understand, don’t worry. Let’s use an analogy from daily life to help clarify. After all, compared to CPUs and memory which we are more familiar with, our understanding of storage systems is relatively limited, so grasping these three new concepts in Kubernetes, PV, PVC, and StorageClass, may not be easy.

To illustrate with an example, let’s say you want to print 10 pages of documents in your office, so you call the reception desk to explain your needs.

The action of “making a phone call” is like PVC, requesting storage resources from Kubernetes.
The reception desk has various brands of office paper, with different sizes and specifications, which is like StorageClass.
The reception desk selects a brand according to your needs and takes out a package of A4 paper from the inventory. It may contain more than 10 pages, but it can meet your requirements. They add a new record to the register, indicating that you requested office supplies on a certain day. This process represents the binding between PVC and PV.
And the package of A4 paper that you receive in the end is like the PV storage object.

Alright, now that we have a general understanding of these API objects, we can combine them with YAML descriptions and actual operations to gradually gain a deeper understanding.

How to Use YAML to Describe PersistentVolume #

There are many types of PersistentVolumes (PVs) in Kubernetes. Let’s start with the simplest type of local storage, “HostPath”. It is similar to the -v parameter used to mount local directories in Docker and can be used to gain a basic understanding of PV usage.

Because Pods can run on any node in the cluster, as a system administrator, we need to create a directory on each node to be mounted as a local storage volume in the Pod.

For convenience, let’s create a directory named “host-10m-pv” in /tmp, representing a storage device with a capacity of only 10MB.

Once we have the storage, we can use YAML to describe the PV object.

Unfortunately, you cannot create a PV object directly using kubectl create. You can only use kubectl api-resources and kubectl explain to view the field descriptions of PV and manually write the YAML description file for PV.

Here is a YAML example that you can use as a template to edit your own PV:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: host-10m-pv

spec:
  storageClassName: host-test
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 10Mi
  hostPath:
    path: /tmp/host-10m-pv/

The file header of the PV object is simple, following the standard format for API objects. I won’t go into further detail. Let’s focus on the spec section, as each field is important and describes the storage details.

The “storageClassName” is an abstraction of the storage type, as mentioned earlier. This PV is manually managed, and the name can be freely chosen. I have named it “host-test” in this example, but you can change it to words like “manual” or “hand-work”.

The “accessModes” define the access mode of the storage device, which is similar to the file access mode in Linux. Currently, Kubernetes has three modes:

ReadWriteOnce: The storage volume can be read and written, but can only be mounted by a Pod on one node.
ReadOnlyMany: The storage volume is read-only and can be mounted by multiple Pods on any node.
ReadWriteMany: The storage volume can be read and written, and can be mounted by multiple Pods on any node.

Please note that these access modes are applied to nodes, not Pods, as storage is a system-level concept and not part of the Pod process.

Clearly, a local directory can only be used by the local machine, so this PV uses “ReadWriteOnce”.

The third field, “capacity”, is self-explanatory, representing the capacity of the storage device. Here, I set it to 10MB.

I would like to remind you again to be cautious. Kubernetes defines storage capacities using international standards. The base for our daily KB/MB/GB is 1024, so it should be written as Ki/Mi/Gi. Be careful not to make mistakes in units; otherwise, the actual capacity will not match.

The last field, “hostPath”, is the simplest one. It specifies the local path of the storage volume, which is the directory we created on the node.

With these fields, we have described the type, access mode, capacity, and storage location of the PV, and the storage device is now created.

How to Use YAML to Describe a PersistentVolumeClaim #

Once you have a PV, it means that the cluster has a persistent storage available for Pods to use. Now, we need to define a PVC object to request storage from Kubernetes.

The following YAML is an example of a PVC that requests a 5MB storage device with an access mode of ReadWriteOnce:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: host-5m-pvc

spec:
  storageClassName: host-test
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 5Mi

The content of a PVC is similar to a PV, but it does not represent actual storage. Instead, it is a “request” or “declaration” of storage. The fields in the spec describe the “desired state” of the storage.

Therefore, the storageClassName, accessModes, and PV in the PVC are the same. However, instead of a capacity field, you need to use resources.requests to specify the desired capacity.

In this way, Kubernetes will search for a PV that matches the StorageClass and capacity specified in the PVC. Then, it will bind the PV and PVC together to allocate storage. This process is similar to calling to request an A4 sheet of paper as mentioned earlier.

How to Use PersistentVolume in Kubernetes #

Now that we have prepared the PV and PVC, we can enable persistent storage for the Pod.

First, we need to create the PV object using kubectl apply:

kubectl apply -f host-path-pv.yml

Then, use kubectl get to check its status:

kubectl get pv

From the screenshot, we can see that this PV has a capacity of 10MB, an access mode of RWO (ReadWriteOnce), and a StorageClass named host-test. The status is shown as Available, which means it is ready to be assigned to a Pod.

Next, we create the PVC to request storage resources:

kubectl apply -f host-path-pvc.yml
kubectl get pvc

Once the PVC object is created successfully, Kubernetes immediately searches for a suitable PV in the cluster based on the StorageClass, resources, and other conditions. If a matching storage object is found, it will bind the PVC and PV together.

The PVC requests 5MB, but there is only one PV with a capacity of 10MB in the system. Since there is no better match, Kubernetes allocates this PV. The excess capacity can be considered as a “bonus”.

You will see that the status of both objects is Bound, indicating a successful storage allocation. The actual capacity of the PVC is 10MB, the same as the capacity of the PV, not the originally requested 5MB.

So, what happens if we increase the requested capacity of the PVC? For example, if we change it to 100MB:

You will see that the PVC remains in the Pending state, which means Kubernetes cannot find a suitable storage in the system to allocate resources. It can only wait for a PV that meets the requirements to complete the binding.

How to Mount PersistentVolume for a Pod #

After binding PV and PVC, we have persistent storage. Now we can mount the storage volume for the Pod. The usage is similar to Lecture 14. First, define the storage volume in spec.volumes, and then mount it into the container in containers.volumeMounts.

However, since we are using PVC, we need to specify the name of the PVC in the persistentVolumeClaim field within volumes.

The following is the YAML description file for the Pod, where the storage volume is mounted to the /tmp directory of the Nginx container:

apiVersion: v1
kind: Pod
metadata:
  name: host-pvc-pod

spec:
  volumes:
  - name: host-pvc-vol
    persistentVolumeClaim:
      claimName: host-5m-pvc

  containers:
    - name: ngx-pvc-pod
      image: nginx:alpine
      ports:
      - containerPort: 80
      volumeMounts:
      - name: host-pvc-vol
        mountPath: /tmp

I have created a diagram to illustrate the relationship between the Pod and the PVC/PV (accessModes field is omitted). From the diagram, you can see how they are connected:

Now let’s create this Pod and check its status:

kubectl apply -f host-path-pod.yml
kubectl get pod -o wide

The Pod has been scheduled to a worker node by Kubernetes. Did the PV mount successfully? Let’s enter the container using kubectl exec and execute some commands:

A file named a.txt is created in the /tmp directory of the container. According to the definition of the PV, it should be stored on the disk of the worker node. Let’s log in to the worker node and check:

You will see that there is indeed a file named a.txt in the local directory of the worker node. By comparing the timestamps, we can confirm that it is the file generated in the Pod just now.

Since the data generated by the Pod is stored on the disk through the PV, if the Pod is deleted and recreated, the storage volume will still use the same directory, and the data will remain unchanged, achieving persistent storage.

However, there is a small problem. Because this PV is of the HostPath type and only stored locally on the current node, if the Pod is scheduled to a different node during reconstruction, even if the local directory is mounted, it will not be the same storage location as before, and the persistence feature will fail.

Therefore, HostPath type PVs are generally used for testing or for applications closely related to nodes, such as DaemonSets. We will discuss how to achieve true arbitrary data persistence in the next lecture.

Summary #

Alright, today we learned about the solutions for persistent storage in Kubernetes. There are three API objects: PersistentVolume, PersistentVolumeClaim, and StorageClass. They manage storage resources in the cluster, which can be thought of as disks. Pods need to use these objects to achieve data persistence.

Let’s summarize today’s main content:

PersistentVolume, also known as PV, is an abstraction of storage devices in Kubernetes. It is maintained by system administrators and needs to describe important information such as the type, access mode, and capacity of the storage device.
PersistentVolumeClaim, also known as PVC, represents the request from a Pod to the system for storage resources. It declares the requirements for storage, and Kubernetes will find the most suitable PV and bind them together.
StorageClass abstracts specific types of storage systems and groups PV objects. It simplifies the binding process between PVs and PVCs.
HostPath is the simplest type of PV, where data is stored locally on the node. It provides fast access but cannot be migrated with Pods.

Homework #

Finally, it’s homework time. I have two questions for you to think about:

HostPath type of PV requires the corresponding directory to exist on the node. What will happen if this directory does not exist (e.g., if it was forgotten to be created)?
What are your thoughts on the process of using PV/PVC/StorageClass objects for storage allocation? Do you think their abstraction is good or bad?

Masters in the field need to be self-driven. In this advanced part, I am really looking forward to hearing your thoughts. See you in the next class.