02 in Depth Understanding of Kubernetes Scheduling Objects

02 In-depth Understanding of Kubernetes Scheduling Objects #

Kubernetes is a distributed container application orchestration system. When we use it to host business workloads, the main descriptive objects we use are Deployment, ReplicaSet, StatefulSet, and DaemonSet. Readers may wonder why the objects managed by Kubernetes are not Pods. Why not talk about how to flexibly configure Pod parameters? In fact, these objects are extensions and encapsulations of Pod objects. And these objects are solidified in the Kubernetes system as core workload APIs. Therefore, it is necessary for us to review and understand these descriptive objects, configure them reasonably based on production practice scenarios, and allow the Kubernetes system to better support our business needs. This article will start with the scenario of practical application deployment, analyze and sort out the descriptive factors that need to be considered in specific scenarios, and summarize a set of flexible practices for using descriptive objects.

Conventional Business Container Deployment Strategies #

Strategy 1: Forcing the Running of at Least 2 Container Instances #

In the scenario of handling conventional business containers, Kubernetes provides the Deployment standard descriptive object. From the command, we can understand that its purpose is to deploy container applications. Deployment manages business container Pods. Because container technology has most of the characteristics of virtual machines, it often makes users mistakenly believe that containers are the next generation of virtual machines. From the perspective of ordinary users, virtual machines give the impression of stability and reliability. If users mistakenly classify business container Pods as stable and reliable instances, it is a complete misunderstanding. Container group Pods are often designed as short-lived instances and cannot persistently store process states like virtual machines. Because of the fragility of container group Pod instances, the number of instances released each time must be multiple replicas, with a minimum default of 2.

Example of deploying multiple replicas:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
  labels:
    app: nginx
spec:
  replicas: 2
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.7.9
        ports:
        - containerPort: 80

Strategy 2: Using Node Affinity and Pod Affinity/Anti-Affinity to Ensure High Availability of Pods #

When deploying multiple replica instances of business containers, you need to pay close attention to a fact. The default scheduling policy of Kubernetes is to select the most idle resource host to deploy container applications, without considering the actual situation of business high availability. The more business containers you deploy in your cluster, the greater your business risk. Once the host where your business container is located crashes, it will bring a container restart storm. In order to achieve the scenario of business fault tolerance and high availability, we need to consider reasonable deployments through Node affinity and Pod anti-affinity. It should be noted that the scheduling system interface of Kubernetes is open, and you can implement your own business scheduling policy to replace the default scheduling policy. Here, we try to use Kubernetes native capabilities as much as possible to achieve our purpose.

In order to better understand the importance of high availability, let’s delve into some actual business scenarios.

First, Kubernetes is not the Borg system used internally by Google. Most small and medium-sized enterprises use manually scaled private resource pools for Kubernetes deployment solutions. When you deploy containers into the cluster, the cluster will not automatically scale hosts and automatically deploy container Pods to meet the resource requirements. Even on public cloud Kubernetes services, elastic scaling of resources can only be achieved when you choose Serverless Kubernetes. Many traditional enterprises are more concerned about the elastic scaling capability when landing Kubernetes technology. Currently, they can only meet the elastic needs of business containers within the limitations of limited static resources by dynamically starting and stopping container group Pods. An inappropriate analogy is that in a real estate agent, from an independent apartment to a shared apartment, the space does not actually expand. In the case of limited actual resources, Kubernetes provides labeling functions. You can label hosts and container group Pods with various labels. The flexible use of these labels can help you quickly achieve high availability of business operations.

Second, in practice, you will find that in order to efficiently and effectively control business containers, you need to manage the resources of the hosts. You cannot let Kubernetes scheduling allocate hosts to start containers freely, which was not a problem when resources were abundant in the early stages. When your business becomes more complex, you will deploy more containers into the resource pool. At this time, potential crises in your business operations will appear because you have not managed scheduling resources, resulting in many critical businesses running on the same server. When a host crashes, it becomes difficult for you to handle this disaster. So in practical business scenarios, the relationships between businesses need to be clarified, and modularize host resources into units, such as two hosts as one unit, deploying fixed business container group Pods, and ensuring that the container group Pods run on these two hosts in a sufficiently distributed manner, so that the failure of any host will not affect the main business, achieving true high availability.

Example of Node Affinity:

pods/pod-with-node-affinity.yaml 

apiVersion: v1
kind: Pod
metadata:
  name: with-node-affinity
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.io/e2e-az-name
            operator: In
            values:
            - e2e-az1
            - e2e-az2
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 1

      preference:
        matchExpressions:
          - key: another-node-label-key
            operator: In
            values:
              - another-node-label-value
containers:
  - name: with-node-affinity
    image: gcr.azk8s.cn/google-samples/node-hello:1.0

Currently, there are two types of node affinity: requiredDuringSchedulingIgnoredDuringExecution and preferredDuringSchedulingIgnoredDuringExecution. You can think of them as “hard” and “soft” types. The former specifies rules that must be met to schedule a pod to a node, while the latter specifies preferences that the scheduler will attempt to fulfill but cannot guarantee.

Examples of Pod Affinity and Anti-Affinity

For example, the use of anti-affinity can help avoid single points of failure:

    affinity:
      podAntiAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
                - key: "app"
                  operator: In
                  values:
                    - zk
            topologyKey: "kubernetes.io/hostname"

This means that pods with the label key "app" and the value "zk" will be scheduled on different nodes based on the hostname namespace.

Strategy 3: Use preStop Hook and readinessProbe to Ensure Smooth Service Updates without Interruption #

After deploying an application, the next step is to perform service updates. Ensuring that the business is not disrupted during container updates is the most important concern. Here is an example:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
spec:
  replicas: 1
  selector:
    matchLabels:
      component: nginx
  template:
    metadata:
      labels:
        component: nginx
    spec:
      containers:
      - name: nginx
        image: xds2000/nginx-hostname
        ports:
          - name: http
            hostPort: 80
            containerPort: 80
            protocol: TCP
        readinessProbe:
          httpGet:
            path: /
            port: 80
            httpHeaders:
              - name: X-Custom-Header
                value: Awesome
          initialDelaySeconds: 15
          timeoutSeconds: 1

lifecycle:
  preStop:
    exec:
      command: ["/bin/bash", "-c", "sleep 30"]

Add a readinessProbe to the container inside the Pod. Typically, after the container is fully started, it listens on an HTTP port. The kubelet sends readiness probes to check if the container is ready. If the container responds correctly, it is marked as ready and the container state is changed to Ready. The Pod is added to the Endpoint IP:Port list by the Endpoint Controller only when all containers in the Pod are ready. Then kube-proxy updates the node forwarding rules. Even if there are requests immediately forwarded to the new Pod, they can be processed normally, avoiding connection issues.

Add a preStop hook to the container inside the Pod that allows the Pod to sleep for a period of time before it is actually terminated. This gives the Endpoint Controller and kube-proxy time to update the Endpoint and forwarding rules. During this time, the Pod is in the Terminating state. Even if requests are forwarded to the Terminating Pod before the forwarding rules are fully updated, they can still be processed normally because the Pod is still sleeping and has not been terminated.

#### Strategy 4: North-South Traffic Forwarding using Wildcard Domain Names

In a typical cluster, a public IP is exposed as the entry point for traffic (either through Ingress or Service), and a wildcard domain is configured to point to that IP (e.g. *.test.foo.io). We want to forward requests to different backend Services based on the different Hosts in the requests. For example, requests to a.test.foo.io should be forwarded to my-svc-a, and requests to b.test.foo.io should be forwarded to my-svc-b. The current native Kubernetes Ingress does not support this wildcard domain forwarding rule. We need to use Nginx's Lua programming capability to achieve wildcard domain forwarding.

Nginx proxy example (proxy.yaml):

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    component: nginx
  name: proxy
spec:
  replicas: 1
  selector:
    matchLabels:
      component: nginx
  template:
    metadata:
      labels:
        component: nginx
    spec:
      containers:
      - name: nginx
        image: "openresty/openresty:centos"
        ports:
        - name: http
          containerPort: 80
          protocol: TCP
        volumeMounts:
        - mountPath: /usr/local/openresty/nginx/conf/nginx.conf
          name: config
          subPath: nginx.conf
      - name: dnsmasq
        image: "janeczku/go-dnsmasq:release-1.0.7"
        args:
          - --listen
          - "127.0.0.1:53"
          - --default-resolver
          - --append-search-domains
          - --hostsfile=/etc/hosts
          - --verbose
  volumes:
  - name: config
    configMap:
      name: configmap-nginx
---
apiVersion: v1
kind: ConfigMap
metadata:
  labels:
    component: nginx
  name: configmap-nginx
data:
  nginx.conf: |-
    worker_processes  1;

    error_log  /error.log;

    events {
        accept_mutex on;
        multi_accept on;
        use epoll;
        worker_connections  1024;
    }

    http {
        include       mime.types;
        default_type  application/octet-stream;
        log_format  main  '$time_local $remote_user $remote_addr $host $request_uri $request_method $http_cookie '
                          '$status $body_bytes_sent "$http_referer" '
                          '"$http_user_agent" "$http_x_forwarded_for" '
                          '$request_time $upstream_response_time "$upstream_cache_status"';

        log_format  browser '$time_iso8601 $cookie_km_uid $remote_addr $host $request_uri $request_method '
                          '$status $body_bytes_sent "$http_referer" '
                          '"$http_user_agent" "$http_x_forwarded_for" '
                          '$request_time $upstream_response_time "$upstream_cache_status" $http_x_requested_with $http_x_real_ip $upstream_addr $request_body';

        log_format client '{"@timestamp":"$time_iso8601",'
                          '"time_local":"$time_local",'
                          '"remote_user":"$remote_user",'
                          '"http_x_forwarded_for":"$http_x_forwarded_for",'
                          '"host":"$server_addr",'
                          '"remote_addr":"$remote_addr",'
                          '"http_x_real_ip":"$http_x_real_ip",'
                          '"body_bytes_sent":$body_bytes_sent,'
                          '"request_time":$request_time,'
                          '"status":$status,'
                          '"upstream_response_time":"$upstream_response_time",'
                          '"upstream_response_status":"$upstream_status",'
                          '"request":"$request",'
                          '"http_referer":"$http_referer",'
                          '"http_user_agent":"$http_user_agent"}';

        access_log  /access.log  main;

        sendfile        on;

        keepalive_timeout 120s 100s;
        keepalive_requests 500;
        send_timeout 60000s;
        client_header_buffer_size 4k;
        proxy_ignore_client_abort on;
        proxy_buffers 16 32k;
        proxy_buffer_size 64k;

        proxy_busy_buffers_size 64k;

        proxy_send_timeout 60000;
        proxy_read_timeout 60000;
        proxy_connect_timeout 60000;
        proxy_cache_valid 200 304 2h;
        proxy_cache_valid 500 404 2s;
        proxy_cache_key $host$request_uri$cookie_user;
        proxy_cache_methods GET HEAD POST;

        proxy_redirect off;
        proxy_http_version 1.1;
        proxy_set_header Host                $http_host;
        proxy_set_header X-Real-IP           $remote_addr;
        proxy_set_header X-Forwarded-For     $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto   $scheme;
        proxy_set_header X-Frame-Options     SAMEORIGIN;

server_tokens off;
client_max_body_size 50G;
add_header X-Cache $upstream_cache_status;
autoindex off;

resolver      127.0.0.1:53 ipv6=off;

server {
    listen 80;

    location / {
        set $service  '';
        rewrite_by_lua '
            local host = ngx.var.host
            local m = ngx.re.match(host, "(.+).test.foo.io")
            if m then
                ngx.var.service = "my-svc-" .. m[1]
            end
        ';
        proxy_pass http://$service;
    }
}
}

Example using Service (service.yaml):

apiVersion: v1
kind: Service
metadata:
  labels:
    component: nginx
  name: service-nginx
spec:
  type: LoadBalancer
  ports:
  - name: http
    port: 80
    targetPort: http
  selector:
    component: nginx

Example using Ingress (ingress.yaml):

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: ingress-nginx
spec:
  rules:
  - host: "*.test.foo.io"
    http:
      paths:
      - backend:
          serviceName: service-nginx
          servicePort: 80
        path: /

StatefulSet Deployment Strategy for Stateful Applications #

StatefulSets are designed to be used with stateful applications and distributed systems. To understand the basic features of StatefulSets, we will use a StatefulSet to deploy a simple web application.

Create a StatefulSet example (web.yaml):

apiVersion: v1
kind: Service
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  ports:
  - port: 80
    name: web
  clusterIP: None
  selector:
    app: nginx
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: web
spec:
  serviceName: "nginx"
  replicas: 2
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: gcr.azk8s.cn/google_containers/nginx-slim:0.8
        ports:
        - containerPort: 80
          name: web
        volumeMounts:
        - name: www
          mountPath: /usr/share/nginx/html
  volumeClaimTemplates:
  - metadata:
      name: www
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 1Gi

Note that one of the characteristics of a StatefulSet object is that the Pods in the StatefulSet have a unique ordinal index and stable network identity. The output will ultimately look like this:

kubectl get pods -l app=nginx
NAME      READY     STATUS    RESTARTS   AGE
web-0     1/1       Running   0          1m
web-1     1/1       Running   0          1m

Many documents can be misleading when discussing the concept of a StatefulSet object, often equating the instances of container instances with stateful instances. This is an inaccurate explanation. In the world of Kubernetes, containers with stable network identities are considered stateful applications. For example:

for i in 0 1; do kubectl exec web-$i -- sh -c 'hostname'; done
web-0
web-1

In addition, we use kubectl run to run a container that provides the nslookup command. By executing nslookup on the hostname of the Pod, you can check their DNS addresses within the cluster. Here is an example:

kubectl run -i --tty --image busybox:1.28 dns-test --restart=Never --rm   
nslookup web-0.nginx
Server:    10.0.0.10
Address 1: 10.0.0.10 kube-dns.kube-system.svc.cluster.local

Name:      web-0.nginx
Address 1: 10.244.1.6

nslookup web-1.nginx
Server:    10.0.0.10
Address 1: 10.0.0.10 kube-dns.kube-system.svc.cluster.local

Name:      web-1.nginx
Address 1: 10.244.2.6

Strategy 5: Flexible Use of StatefulSet Pod Management Strategy #

For general distributed microservice systems, the sequential guarantee of StatefulSet is unnecessary. These systems only require uniqueness and identity. To accelerate this deployment strategy, we can solve it by introducing .spec.podManagementPolicy.

The parallel pod management strategy tells the StatefulSet controller to terminate all Pods in parallel without waiting for these Pods to become Running and Ready or fully terminated before starting or terminating another Pod. Here is an example:

apiVersion: v1
kind: Service
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  ports:
  - port: 80
    name: web
  clusterIP: None
  selector:
    app: nginx
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: web
spec:
  serviceName: "nginx"
  podManagementPolicy: "Parallel"
  replicas: 2
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: gcr.azk8s.cn/google_containers/nginx-slim:0.8
        ports:
        - containerPort: 80
          name: web
        volumeMounts:
        - name: www
          mountPath: /usr/share/nginx/html
  volumeClaimTemplates:
  - metadata:
      name: www
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 1Gi

Deployment Strategy for Business Operations Containers #

When deploying Kubernetes extensions such as DNS, Ingress, and Calico capabilities, it is necessary to deploy daemon programs on each worker node. This requires using DaemonSet to deploy system business containers. The default DaemonSet adopts a rolling update strategy to update containers, which can be confirmed by executing the following command:

kubectl get ds/<daemonset-name> -o go-template='{{.spec.updateStrategy.type}}{{"\n"}}'

RollingUpdate

In daily work, we only need to replace the image for the daemon process:

kubectl set image ds/<daemonset-name> <container-name>=<container-new-image>

Check the rolling update status to confirm the current progress:

kubectl rollout status ds/<daemonset-name>

When the rolling update is complete, the output will be:

daemonset "<daemonset-name>" successfully rolled out

In addition, we also have requirements to run scripts on a regular basis. These requirements can be managed using the CronJob object provided by Kubernetes. Here is an example:

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: hello
spec:
  schedule: "*/1 * * * *"
  successfulJobsHistoryLimit: 0
  failedJobsHistoryLimit: 0
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: hello
            image: busybox
            args:
            - /bin/sh
            - -c
            - date; echo Hello from the Kubernetes cluster
          restartPolicy: OnFailure

Summary #

This article starts from actual business needs and helps readers understand the stable versions of Kubernetes’ workload definition orchestration objects: Deployment, StatefulSet, DaemonSet, and CronJob. All the information provided is a summary of practical experiences shared by the industry, removing the verbose or incorrect introductions from many documents, and helping readers establish reasonable usage strategies for orchestration objects. Besides these core orchestration objects, Kubernetes also provides extension interfaces. By using the Operator programming framework, you can customize the orchestration objects you need, standardize your operational experience with code, and make your continuous deployment process more convenient and efficient.