20 Analysis of Stateful Applications' Default Characteristics Implementation

20 Analysis of Stateful Applications’ Default Characteristics Implementation #

Traditionally, applications running on Kubernetes are stateless, meaning that all data is not persisted. When an application dies, its state is also lost. An example of this is Nginx used as a reverse proxy. However, if your application involves business logic, it usually requires storing data locally. If one application instance dies, a new instance can be launched to continue serving current connection requests. So, what are the key features to remember when it comes to stateful applications in the Kubernetes environment? Let’s explore them step by step.

StatefulSet Object #

When deploying application container instances using the Deployment object, you will notice that the Pod instances have a random string suffix. This is a strategy used to differentiate instances in stateless applications. However, random string identifiers are not applicable for orchestrating distributed systems in real-world scenarios. This is where the StatefulSet object comes into play. The following example shows the sequential index of Pods:

kubectl get pods -l app=nginx
NAME      READY     STATUS    RESTARTS   AGE
web-0     1/1       Running   0          1m
web-1     1/1       Running   0          1m

If you delete all Pods in the terminal, the StatefulSet will automatically restart them:

kubectl delete pod -l app=nginx
pod "web-0" deleted
pod "web-1" deleted

kubectl get pod -w -l app=nginx
NAME      READY     STATUS              RESTARTS   AGE
web-0     0/1       ContainerCreating   0          0s
NAME      READY     STATUS    RESTARTS   AGE
web-0     1/1       Running   0          2s
web-1     0/1       Pending   0         0s
web-1     0/1       Pending   0         0s
web-1     0/1       ContainerCreating   0         0s
web-1     1/1       Running   0         34s

Using kubectl exec and kubectl run, you can view the Pod’s hostname and DNS entries within the cluster:

for i in 0 1; do kubectl exec web-$i -- sh -c 'hostname'; done
web-0
web-1

kubectl run -i --tty --image busybox:1.28 dns-test --restart=Never --rm /bin/sh
nslookup web-0.nginx
Server:    10.0.0.10
Address 1: 10.0.0.10 kube-dns.kube-system.svc.cluster.local

Name:      web-0.nginx
Address 1: 10.244.1.7

nslookup web-1.nginx
Server:    10.0.0.10
Address 1: 10.0.0.10 kube-dns.kube-system.svc.cluster.local

Name:      web-1.nginx
Address 1: 10.244.2.8

The Pod’s index, hostname, SRV entries, and record names remain unchanged, but the associated IP addresses have changed. This phenomenon indicates that even for stateful container instances, their IP addresses are variable. Although it may seem reasonable to expect fixed Pod IPs based on certain scenarios, it is not the recommended design approach for cloud-native architectures. Therefore, this is not a default feature in Kubernetes. To support this feature, extensions need to be made to the CNI (Container Network Interface). An open-source networking solution called Calico provides this feature. Please refer to the following:

# Configure IPAM
cat /etc/cni/net.d/10-calico.conflist

# Configure IPAM, this CNI plugin will parse the specified annotation for IP configuration
    "ipam": {
          "type": "calico-ipam"
      },

# Add annotation to the Pod object
annotations:
      "cni.projectcalico.org/ipAddrs": "[\"192.168.0.1\"]"

Here is an annotation example provided by Tencent Cloud to support the fixed IP feature:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  annotations:
    tke.cloud.tencent.com/enable-static-ip: "true"
  labels:
    k8s-app: busybox
  name: busybox
  namespace: default
spec:
  replicas: 3
  selector:
    matchLabels:
      k8s-app: busybox
      qcloud-app: busybox
  serviceName: ""
  template:
    metadata:
      annotations:
        tke.cloud.tencent.com/vpc-ip-claim-delete-policy: Never
      creationTimestamp: null
      labels:
        k8s-app: busybox
        qcloud-app: busybox
    spec:
      containers:
      - args:
        - "10000000000"
        command:
        - sleep
        image: busybox
        imagePullPolicy: Always
        name: busybox
        resources:
          limits:
            tke.cloud.tencent.com/eni-ip: "1"
          requests:
            tke.cloud.tencent.com/eni-ip: "1"

Stateful Storage #

In most cases, StatefulSets also need to be started with persistent storage. Starting from Kubernetes version 1.13, Kubernetes has fully embraced the CSI (Container Storage Interface) standard, and the default process is to first create a StorageClass, and then dynamically allocate storage resources using PersistentVolumeClaim objects. The underlying PersistentVolume object drives the StorageClass to call the specified storage driver to create storage devices. Since the complexity of each storage driver’s design may vary, it is recommended that readers start accumulating experience with NFS storage.

Many readers mistakenly believe that with the StatefulSet and PersistentVolume, they can handle the deployment of all stateful applications. In my practical experience, in many cases, you need to configure appropriate features for each application deployment method to truly ensure the operation of stateful applications. Because of this complexity, the industry has introduced the Operator framework to provide one-click deployment management controllers for complex applications. If you carefully decompose these controllers, you will find that they are nothing more than the assembly of Pod features. So do not be confused by the examples given. For the deployment of stateful applications, you need to have a detailed understanding of the architecture layout, and then combine it with the features provided by Kubernetes to support it.

By default, Kubernetes can deploy the Pods of StatefulSet on the same node. If two services coexist on the same node and that node fails, your service will be affected. So when you expect the service to minimize downtime, you should configure podAntiAffinity.

For example, to get the node where the Pods of zk StatefulSet are located:

for i in 0 1 2; do kubectl get pod zk-$i --template {{.spec.nodeName}}; echo ""; done

kubernetes-minion-group-cxpk
kubernetes-minion-group-a5aq
kubernetes-minion-group-2g2d

All the Pods in the zk StatefulSet are deployed on different nodes.

This is because the Pods in the zk StatefulSet specify PodAntiAffinity:

          affinity:
            podAntiAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
                - labelSelector:
                    matchExpressions:
                      - key: "app"
                        operator: In
                        values:
                        - zk
                  topologyKey: "kubernetes.io/hostname"

Please use this technique flexibly to deal with highly fault-tolerant application scenarios.

Stateful Update Strategy #

By default, the update strategy of StatefulSet is configured using the spec.updateStrategy field. The spec.updateStrategy.type field accepts OnDelete or RollingUpdate as values. By default, OnDelete prevents the controller from automatically updating its Pods. You have to manually delete the Pods to make the controller create new Pods to reflect your changes. Another strategy is RollingUpdate which implements automatic rolling updates of Pods in the StatefulSet. RollingUpdate causes the controller to delete and recreate each individual Pod, and only processes one Pod at a time. The controller updates its previous Pod only after the updated Pod is running and ready. The StatefulSet controller updates all Pods in reverse order while adhering to the StatefulSet’s guarantee rules.

Obviously, the default RollingUpdate update strategy takes a long time to complete the update. If more flexible features are needed, you can use open-source extension plugins to enhance the features of StatefulSet, such as using the OpenKruise scheduler. One feature I would like to introduce is the strategy for in-place pod updates: In-Place Pod Update Strategy.

apiVersion: apps.kruise.io/v1alpha1
kind: StatefulSet
spec:
  # ...
  podManagementPolicy: Parallel
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      podUpdatePolicy: InPlaceIfPossible
      inPlaceUpdateStrategy:
        gracePeriodSeconds: 10

StatefulSet has added podUpdatePolicy to allow users to specify whether to perform a rebuild upgrade or an in-place upgrade.

ReCreate: The controller will delete the old Pod and its PVC, and then recreate them with the new version.
InPlaceIfPossible: The controller will first attempt an in-place upgrade of the Pod, and if that is not possible, it will resort to a rebuild upgrade. Currently, only modifications to the spec.template.metadata.* and spec.template.spec.containers[x].image fields allow for in-place upgrade.
InPlaceOnly: The controller only allows for in-place upgrade. Therefore, users can only modify the restricted fields mentioned in the previous point, and any attempt to modify other fields will be rejected by Kruise.

In our business environment, the most frequent updates we make are to the image version, so this requirement is particularly suitable for day-to-day application operations and management in the cloud-native ecosystem.

More importantly, when using the InPlaceIfPossible or InPlaceOnly strategies, it is necessary to add an InPlaceUpdateReady readinessGate to set the Pod to NotReady during the in-place upgrade. Here is a complete example:

apiVersion: apps.kruise.io/v1alpha1
kind: StatefulSet
metadata:
  name: sample
spec:
  replicas: 3
  serviceName: fake-service
  selector:
    matchLabels:
      app: sample
  template:
    metadata:
      labels:
        app: sample
    spec:
      readinessGates:
        - conditionType: InPlaceUpdateReady
      containers:
        - name: main
          image: nginx:alpine
  podManagementPolicy: Parallel
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      podUpdatePolicy: InPlaceIfPossible
      maxUnavailable: 2

The OpenKruise scheduler also provides extensions for many other objects. If you are interested, you can explore them as extensions, which will not be elaborated on here.

Summary #

Stateful applications are generally composed of multiple different types of images. It is not possible to horizontally scale them by simply building one image and using a Deployment object like Nginx. During the early deployment of stateful applications, everyone only saw the convenience of deploying containers using YAML, and did not effectively recognize the shortcomings of Kubernetes. Although Helm management tools have been developed for deploying applications, they are still much simpler for deploying individual applications. Examples of deploying multiple applications are mostly toy-like demonstrations and cannot be considered as production-ready examples. From the perspective of real-world operations, the most suitable production examples still require the use of operators to build their own deployment frameworks. Of course, there are now more open-source operators that can be used for reference, which can play a demonstrative role to some extent.

Starting from the characteristics of stateful applications, the first concern is their uniqueness, which Kubernetes ensures through StatefulSet. In terms of application robustness, PodAntiAffinity must be used. The default update strategy is manual deletion, and rolling updates are performed one by one, which can take a long time. To improve efficiency, an open-source extended scheduler can be used to enhance business operability. I believe that the in-place update strategy is currently the most practical one.