10 Application Storage and Persistent Data Volumes Snapshot and Extension Scheduling to Heaven

10 Application Storage and Persistent Data Volumes Snapshot and Extension Scheduling to Heaven #

This article will mainly cover the following two aspects:

The concept, usage, and working principle of storage snapshots;
The background, concept, usage, and working principle of storage topology scheduling.

Basic Knowledge #

Background of Storage Snapshot #

When using storage, in order to improve the fault tolerance of data operations, we usually need to take snapshots of the online data and have the ability to restore them quickly. In addition, when we need to quickly copy and migrate online data, such as copying environments, data development, etc., we can use storage snapshots to meet these needs. In Kubernetes, the CSI Snapshotter controller is used to implement the functionality of storage snapshots.

Storage Snapshot User Interface - Snapshot #

As we know, in Kubernetes, the design of PVC (Persistent Volume Claim) and PV (Persistent Volume) simplify the use of storage for users. The design of storage snapshots is actually based on the design principles of PVC and PV. When users need the functionality of storage snapshots, they can declare them through the VolumeSnapshot object and specify the corresponding VolumeSnapshotClass objects. Then, the relevant components in the cluster dynamically generate storage snapshots and the corresponding VolumeSnapshotContent objects. As shown in the comparison diagram below, the process of dynamically generating VolumeSnapshotContent is very similar to the process of dynamically generating PV.

avatar

Storage Snapshot User Interface - Restore #

After obtaining storage snapshots, how do we quickly restore the snapshot data? As shown in the following diagram:

avatar

As shown in the above process, we can specify the dataSource field of the PVC object as the VolumeSnapshot object. When the PVC is submitted, the relevant components in the cluster will find the storage snapshot data pointed to by dataSource, and then create corresponding storage and PV objects, restoring the storage snapshot data to the new PV. This way, the data is restored, which is the usage of storage snapshot restore.

Topology - Meaning #

First, let’s understand what topology means. Here, topology refers to a “location” relationship defined for the managed nodes in a Kubernetes cluster, which means that you can fill in a node’s labels information to indicate to which topology the node belongs.

There are three common types of topology that are frequently encountered when using them:

The first type is often encountered when using cloud storage services, which is the concept of region. In Kubernetes, it is often marked by the label failure-domain.beta.kubernetes.io/region. This is used to determine to which region the nodes across regions managed by a single Kubernetes cluster belong.
The second type is the commonly used available zone. In Kubernetes, it is often marked by the label failure-domain.beta.kubernetes.io/zone. This is used to determine to which availability zone the nodes across zones managed by a single Kubernetes cluster belong.
The third type is hostname, which is at the scale of a single machine, with the topology domain being the node itself. In Kubernetes, it is often marked by the label kubernetes.io/hostname. This will be described in more detail later when discussing local PV.

The three topologies mentioned above are more commonly used, but topology can actually be defined by users. You can define a string to represent a topology domain, and the value corresponding to this key is actually different topology positions within the topology domain.

For example, we can use rack, which represents the dimension of racks in data centers, as a topology domain. This way, machines on different racks can be marked with different topology positions, which means that the location relationship of machines on different racks can be indicated by the rack dimension. Machines on rack1 can be labeled with the rack attribute in the node label. Its value can be set as rack1, which means rack=rack1. Machines on another set of racks can be labeled as rack=rack2. This way, the location of nodes in K8s can be distinguished through the dimension of racks.

Next, let’s take a look at the use of topology in K8s storage.

Background of Storage Topology Scheduling #

In the previous lesson, we discussed how the PV and PVC system in K8s separates storage resources from compute resources. If a created PV has restrictions on “access location,” meaning it specifies which nodes can access this PV through nodeAffinity, why is there such a limitation on access location?

This limitation exists because the creation process of pods and PVs in K8s can be considered parallel. Therefore, there is no guarantee that the node where the pod eventually runs can access the storage restricted by the PV, resulting in the pod not running properly. Let’s look at two classic examples:

First, let’s look at an example of Local PV. Local PV encapsulates local storage on a single node as a PV and accesses this local storage using a PV. Why is Local PV necessary? When PV or PVC systems were initially used, they were mainly used for distributed storage systems. Distributed storage relies on networking, but if certain services require high I/O performance that cannot be met through network access to distributed storage, local storage is needed to achieve higher performance by excluding network overhead. However, using local storage also has drawbacks. Distributed storage can ensure high availability through multiple replicas, but local storage requires the business to implement high availability with protocols like Raft.

Now, let’s consider the problem that may arise if we don’t impose “access location” restrictions on PV in the Local PV scenario:

When a user submits a PVC, the K8s PV controller may bind the PV on node2. However, the pod that actually uses this PV may be scheduled on node1, which ultimately results in the pod not being able to use this storage because it needs to use the storage on node2.

The second scenario where not imposing “access location” restrictions on PV could cause problems:

If the K8s cluster is distributed across multiple availability zones in a single region, when dynamically provisioning storage, the created storage may belong to availability zone 2. However, when submitting a pod that uses this storage, it may be scheduled to availability zone 1, and then it cannot use this storage. Therefore, services like Alibaba Cloud’s block storage cannot be used across availability zones. If the created storage belongs to availability zone 2, but the pod runs in availability zone 1, it cannot use this storage. This is another common problem scenario.

Now let’s see how K8s solves these problems through storage topology scheduling.

Storage Topology Scheduling #

To summarize the two previous problems, in both cases, when binding a PV to a PVC or dynamically creating a PV, I do not know which nodes will use the PV later. However, the usage of the PV itself imposes topological restrictions on the node where the pod is located. In the case of Local PV, the pod needs to be scheduled on a specified node to use that PV, while in the second problem scenario, it means the pod that uses the PV must be scheduled on a node in the same availability zone as the PV to use Alibaba Cloud Block Storage. So how does K8s solve this?

Simply put, in K8s, the binding operation between PV and PVC and the dynamic creation of PVs are delayed until after the pod scheduling results are obtained. What are the benefits of doing this?

Firstly, in the case of preallocated PVs like Local PV, the PVC that corresponds to the PV that will be used has not been bound yet. During the scheduling process, the scheduler can choose a node that not only meets the compute resource requirements of the pod (such as CPU/memory) but also satisfies the nodeAffinity restrictions of the PV that the pod’s PVC can bind to.
Secondly, for dynamically provisioned PVs, it is equivalent to dynamically creating a PV based on the pod’s topology information once the pod’s node is known. This ensures that the newly created PV’s topology is consistent with the topology of the node where it will run. For example, in the case of Alibaba Cloud Block Storage mentioned above, if we know that the pod will run in availability zone 1, then we can create the storage specifically in availability zone 1.

To implement the delayed binding and creation of PVs as mentioned above, there are three components in K8s that require modifications:

PV Controller, which needs to support the delayed binding operation.
Another component is responsible for dynamically generating PVs. After obtaining the pod’s scheduling information, it needs to dynamically create the PV based on the node’s topology information.
The third and most important component is the kube-scheduler. When selecting nodes for pods, it needs to consider not only the compute resource requirements of the pod but also the storage requirements based on the PVC. It checks whether the available node can meet the nodeAffinity of the PV that matches the PVC or checks whether the current node satisfies the topology restrictions specified in the StorageClass during the dynamic creation of PVs. This ensures that the node finally selected by the scheduler can meet the storage’s own topological restrictions.

This is the relevant knowledge of storage topology scheduling in K8s.

Use Case Interpretation #

Next, let’s interpret the basic knowledge in the first section using YAML examples.

Volume Snapshot/Restore Example #

avatar

Let’s take a look at how volume snapshots are used. First, the cluster administrator needs to create a VolumeSnapshotClass object in the cluster. One important field in VolumeSnapshotClass is “Snapshot,” which specifies the volume plugin used for creating the actual storage snapshot. This volume plugin needs to be deployed in advance, but we will discuss it later.

To create a storage snapshot, the user needs to declare a VolumeSnapshot object. The VolumeSnapshotClass needs to specify the VolumeSnapshotClassName. Another important field it needs to specify is “source,” which determines the data source for the snapshot. In this case, the name “disk-pvc” is specified, indicating that the storage snapshot is created through this PVC object. After submitting this VolumeSnapshot object, the related components in the cluster will find the PV storage corresponding to this PVC and take a snapshot of it.

Once you have a storage snapshot, how do you restore data from it? It’s actually quite simple. By declaring a new PVC object and specifying its data source as the VolumeSnapshot in the spec.DataSource field, you can restore data from the snapshot. In this case, the object “disk-snapshot” is specified. When this PVC is submitted, the related components in the cluster will dynamically generate a new PV storage, and the data in this new PV storage will come from the storage snapshot taken earlier.

Local PV Example #

Let’s take a look at an example of Local PV in the following YAML:

avatar

When using Local PV, most of the time it is created statically. This means you need to declare a PV object first. Since Local PV can only be accessed locally, you need to restrict the PV to a single node by specifying the nodeAffinity in the PV object. The topology key indicated in the diagram above is marked with “kubernetes.io/hostname,” which means it can only be accessed on node1. If you want to use this PV, your pod must be scheduled on node1.

Since we are creating a PV using the static method, why do we need the storageClassName? As mentioned earlier, in Local PV, if you want it to work properly, you need to use the feature of delayed binding. Since it is delayed binding, even if there are related PVs in the cluster that can match it after the PVC is submitted, it cannot be matched immediately. In other words, the PV controller cannot immediately do the binding. In this case, you need a means to tell the PV controller under what circumstances it cannot bind immediately. The storageClass serves this purpose. As we can see in the storageClass, the provisioner is specified as no-provisioner, which means it tells Kubernetes that it will not create PVs dynamically. The key field used in the storageClass is the VolumeBindingMode, which is set to “WaitForFirstConsumer”. It can be simply understood as delayed binding.

When a user starts submitting a PVC, the PV controller will see this PVC and find the corresponding storageClass. If it finds that the BindingMode is delayed binding, it will not do anything.

Later, when a pod that uses this PVC is scheduled and happens to be scheduled on a node that complies with the pv nodeAffinity, the PVC used in this pod will be bound to the PV. This ensures that when the pod is scheduled to this node, the PVC is bound to the PV, ultimately allowing the created pod to access this Local PV. This is how we meet the topological constraints of PV in the statically provisioning scenario.

Limiting Topology for Dynamic Provisioning PV Example #

Let’s take a look at how we limit the topology for dynamic provisioning PV:

avatar

In dynamic provisioning, we have topology constraints when creating PVs. How do we specify them?

First, we need to specify the BindingMode in the storageClass, which is “WaitForFirstConsumer”, meaning delayed binding.

Secondly, a very important field is “allowedTopologies,” which is where the limitations are set. In the diagram above, we can see that the topology constraint is at the level of availability zone. In fact, there are two levels of meanings here:

The first level means that when dynamically creating PVs, the created PV must be accessible in this availability zone;
The second level means that since it is delayed binding, when the scheduler discovers that the PVC that uses it corresponds to this storageClass, it needs to select nodes located in this availability zone when scheduling pods.

In summary, we need to ensure that when creating a storage dynamically, it must be accessible in this availability zone, and when the scheduler selects a node, it needs to be within this availability zone. This way, the storage and the node where the pod that uses the storage corresponds to will be in the same topological domain. When users write the PVC file, the writing method is the same as before. The main difference is that some topological constraints need to be set in the storageClass.

Operation Demo #

In this section, we will demonstrate the content explained earlier in an online environment.

First, let’s take a look at the K8s service I set up on my Alibaba Cloud server. There are a total of 3 node nodes. One master node and two worker nodes. The master node is unable to schedule pods.

avatar

Now let’s see the plugins I have already deployed, one is the snapshot plugin (csi-external-snapshot_), and the other is the dynamic cloud disk plugin (csi-disk_).

avatar

Now let’s start the snapshot demonstration. First, create a dynamic cloud disk before taking a snapshot. To create a dynamic cloud disk, you need to create a storage class first, and then create a PV dynamically based on PVC. Then create a pod that uses it.

avatar

With the above objects in place, we can now take a snapshot. First, let’s take a look at the first configuration file needed for snapshot, snapshotclass.yaml.

avatar

Basically, it specifies the plugin to be used when creating storage snapshots. This plugin has been deployed as demonstrated just now, which is the csi-external-snapshot-0 plugin.

avatar

Next, create the volume-snapshotclass file, and the snapshot begins after creation.

avatar

Then let’s take a look at snapshot.yaml. The Volumesnapshot declaration creates a storage snapshot. In this case, the PVC specified earlier is the data source for the snapshot. Let’s create it.

avatar

Let’s see if the snapshot has been created. As shown in the figure below, the content was created 11 seconds ago.

avatar

We can take a look at the contents of the snapshot. The important information is recorded in volumesnapshotcontent. After I take a snapshot, it records the snapshot ID returned by the cloud storage provider. Then there is the snapshot data source, which is the PVC specified just now. Through it, I can find the corresponding PV.

avatar

That’s about it for the snapshot demonstration. To delete the snapshot created just now, we still use volumesnapshot. Then take a look at the dynamic volumesnapshotcontent that was also deleted.

avatar

Next, let’s take a look at the process of dynamic PV creation with some topological constraints. First, create the storage class, and then take a look at the constraints made in the storage class. First, specify its BindingMode as WaitForFirstConsumer, which means delayed binding. Then there are topological constraints. In this case, I specified a zone-level constraint in the allowedTopologies field.

avatar

Let’s try creating a PVC. After the PVC is created, theoretically, it should be in the pending state. Let’s take a look. It is currently pending because it needs to be delayed bound. Since there is no pod using it at the moment, it cannot be bound or dynamically create a new PV.

avatar

Next, let’s create a pod that uses this PVC and see what happens. Take a look at the pod. The pod is also in the pending state.

avatar

Why is the pod in a pending state? Let’s take a look. It failed to schedule, and the reason for scheduling failure is that one node, which is the master node, cannot be scheduled, and the other two nodes are not eligible to be bound to PVs.

avatar

Why are there two nodes with the error “no PVs available to bind”? Aren’t they dynamically created?

Let’s take a closer look at the topological constraints in the storage class. From the explanation above, we know that it restricts that the PV storage created using this storage class must be accessible in a zone called cn-hangzhou-d, and the pods using this storage must be scheduled to nodes in the cn-hangzhou-d zone.

avatar

Now let’s see if the node has this topological information. If it doesn’t, of course, it won’t work.

Take a look at the full information of the first node, mainly look for information in its labels, and there is a key in the labels that matches the specified topological requirement. This indicates that there is such a topology, but it is specified as cn-hangzhou-b, while the storage class specified cn-hangzhou-d.

avatar

Now let’s take a look at the other node. The topological information written on it is also hangzhou-b, but the storage class specifies d.

avatar

This makes it impossible to schedule the pod on these two nodes. Now let’s modify the topological constraints in the storage class and change cn-hangzhou-d to cn-hangzhou-b.

avatar

After the modification, let’s take a look. In fact, it means that the PVs dynamically created must be accessible in the cn-hangzhou-b availability zone, and the pods using this storage must be scheduled to nodes in the same availability zone. Delete the previous pod so that it can be rescheduled. Let’s see what the result is. Good, it has been scheduled successfully now, at the stage of starting the container.

avatar

This means that after changing the topological constraints in the storage class from hangzhou-d to hangzhou-b, there are two nodes in the cluster that have a topology relationship that matches the requirement specified in the storage class. This ensures that the pod can be scheduled to a node. In the last point in the figure above, the pod is already running, indicating that the topological constraints modification just now can work.

Processing Flow #

Handling of Volume Snapshot/Restore in Kubernetes #

Next, let’s take a look at the specific processing flow of storage snapshot and topology scheduling in K8s. As shown in the following diagram:

avatar

First, let’s take a look at the processing flow of storage snapshot. Here, let’s first explain the csi part. The recommended way to implement storage extension features in K8s is through the out-of-tree approach using CSI.

The implementation of CSI storage extension mainly consists of two parts:

The first part is the CSI controller part implemented by the K8s community, which includes the csi-snapshottor controller and csi-provisioner controller, which are the common controller parts;
The other part is the different CSI plugins or storage drivers implemented by specific cloud storage vendors using their own OpenAPI.

The two parts are connected through Unix domain sockets. Both parts are needed to form a complete storage extension functionality.

As shown in the above diagram, when a user submits a VolumeSnapshot object, it will be watched by the csi-snapshottor controller. Then it will make a GPPC call to the csi-plugin, and the csi-plugin will use the OpenAPI to actually perform the storage snapshot action. After the storage snapshot has been generated, it will be returned to the csi-snapshottor controller, and this controller will store the relevant information of the generated snapshot in the VolumeSnapshotContent object and bind it to the submitted VolumeSnapshot. This binding is similar to the bound between PV and PVC.

Once the storage snapshot is available, how do we restore the data from it? As mentioned earlier, this is done by declaring a new PVC object and specifying its dataSource as the Snapshot object. When the PVC is submitted, it will be watched by the csi-provisioner, and then it will use GRPC to create the storage. There is a slight difference in creating storage here compared to the previous csi-provisioner explained: it specifies the Snapshot ID, and when creating storage in the cloud vendor, an additional step is required to recover the previous snapshot data into the newly created storage. The flow then returns to the csi-provisioner, which will write the relevant information of the newly created storage into a new PV object. The new PV object will be watched by the PV controller, and it will bind the submitted PVC to the PV. After that, pods can use the restored data through the PVC. This is the processing flow of storage snapshots in K8s.

Handling of Volume Topology-aware Scheduling in Kubernetes #

Next, let’s take a look at the processing flow of storage topology-aware scheduling:

avatar

The first step is to declare delayed binding, which is done through the StorageClass. It has been explained above, so I won’t go into details here.

Next, let’s talk about the scheduler. In the above diagram, the red part represents the new storage topology-aware scheduling logic added to the scheduler. First, let’s take a look at the general flow of selecting a node for a pod without considering the storage (without the red part):

After a user submits a pod, it will be watched by the scheduler, which will first perform preliminary selection. This means that it will match the resources needed by the pod with all the nodes in the cluster;
If there is a match, it means that the node can be used. Of course, there may be more than one node that can be used, and a batch of nodes will be selected in the end;
Then, in the second phase, the scheduler will perform final selection, which means it will score these nodes in order to find the most suitable one;
After that, the scheduler will write the scheduling result to the spec.nodeName field of the pod, and it will be watched by the kubelet on the corresponding node. Finally, the process of creating the pod begins.

Now let’s take a look at how node filtering (second step) is done when considering the volumes:

First, we need to find all the PVCs used in the pod, including the bound PVCs and the ones that require delayed binding;
For the bound PVCs, the scheduler needs to check if the node affinity of the corresponding PV matches the topology of the current node. If it doesn’t match, it means that this node cannot be scheduled. If it does match, continue to the next step to check the PVCs that require delayed binding;
For the PVCs that require delayed binding, the scheduler first retrieves the existing PVs in the cluster that satisfy the requirements of the PVCs, and then checks each of them against the topology labels of the current node. If none of them match, it means that the existing PVs cannot satisfy the requirements, so it needs to check whether the current node satisfies the topology restrictions for dynamically creating PVs. In other words, it needs to check whether the topology restrictions declared in the StorageClass match the topology labels already present on the node. If they match, it means that this node can be used; if they don’t match, it means that this node cannot be scheduled.

After going through these steps, we have found all the nodes that satisfy the resource requirements of the pod as well as the storage requirements of the pod.

Once the node is selected, the third step is an optimization performed by the scheduler internally. Here, I will briefly explain it, which is to update the node information in the pod after preselection and final selection, as well as the caching information for PV and PVC in the scheduler.

The fourth step is also an important step. Once the node has been selected for the pod, regardless of whether the PVC it uses needs to bind to an existing PV or needs to dynamically create a new PV, the scheduler can start the process. It triggers the update of the PVC and PV objects’ information, and then triggers the PV controller to perform the binding operation, or the csi-provisioner to perform the dynamic creation process.

Summary #

Compared PVC&PV system to explain the related Kubernetes resource objects and their usage for storage snapshots.
Illustrated the necessity of storage topology scheduling features and how Kubernetes can solve these problems through topology scheduling using two real-life scenarios.
Analyzed the internal workings of storage snapshots and storage topology scheduling in Kubernetes to gain a deeper understanding of how these features work.