18 Practice Chapter Technology Testing for Graceful Traffic Switching in Applications

18 Practice Chapter- Technology Testing for Graceful Traffic Switching in Applications #

After five consecutive technical discussions on application traffic drainage, I believe everyone has gained a deeper understanding of Kubernetes’ service drainage architecture. As the saying goes, “A good memory is better than a bad pen.” Throughout my practice with these parameters, it took me a lot of time to apply Kubernetes’ cluster drainage technology. The following exercise cases are necessary practice to reinforce your knowledge. Please follow my record and practice together.

Exercise 1: Lossless Traffic Application Update under Deployment #

When updating an application, we often find that even if Kubernetes uses a rolling update strategy when deploying the application, the application traffic may still be interrupted for a moment. This confusion stems from the fact that the official documentation emphasizes the ability to perform smooth updates here. Note that it is a smooth update, not a lossless traffic update. So what exactly is the problem? After consulting the information, I found that the core problem lies in the version update of the application in the Pod’s lifecycle, as shown in the following figure. The update operations of related object resources such as Pod, Endpoint, IPVS, Ingress/SLB, etc. are all executed asynchronously. Often, even if the traffic is still being processed, the Pod container may look like this:

3a-sync-flow

Based on the Pod container process lifecycle flowchart, the state changes of the container process are asynchronous. If the Deployment object, which is the application deployment target, does not add the lifecycle parameter “preStop” configuration, even if the North-South traffic is closed, the process still needs a few more seconds to process the session data that is still being executed before it can gracefully exit. The following is the declarative configuration of the application deployment Deployment object:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
spec:
  replicas: 1
  selector:
    matchLabels:
      component: nginx
  progressDeadlineSeconds: 120
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0
  template:
    metadata:
      labels:
        component: nginx
    spec:
      terminationGracePeriodSeconds: 60
      containers:
      - name: nginx
        image: xds2000/nginx-hostname
        ports:
        - name: http
          containerPort: 80
          protocol: TCP
        readinessProbe:
          httpGet:
            path: /
            port: 80
            httpHeaders:
            - name: X-Custom-Header
              value: Awesome
          initialDelaySeconds: 15
          periodSeconds: 3
          timeoutSeconds: 1
        lifecycle:
          preStop:
            exec:
              command: ["/bin/bash", "-c", "sleep 10"]

The readiness probe can determine when the container is ready and can start accepting request traffic. Only when all containers within a Pod are ready can the Pod be considered ready. This signal is used, among other things, to control which Pod serves as the backend for a Service. While a Pod is not ready, it will be removed from the Service’s load balancer. The “periodSeconds” field specifies that the kubelet performs a liveness probe every 3 seconds. The “initialDelaySeconds” field tells the kubelet to wait for 15 seconds before performing the first probe.

When manually deleting a specific Pod using the kubectl tool, the default value of the graceful termination period for the Pod is 30 seconds. If the time required for the preStop callback is longer than the default graceful termination period, you must modify the value of the “terminationGracePeriodSeconds” attribute to make it work properly.

If one of the containers in the Pod defines a preStop callback, the kubelet starts running this callback logic within the container. If the preStop callback logic is still running when the decent termination period times out, the kubelet will request a one-time extension of 2 seconds for the grace period given to the Pod.

After mastering the configuration attributes of these lifecycles proficiently, the traffic of a single Pod can be gracefully handled, which allows the handling of upstream high-level objects to also natively support lossless traffic switching.

Exercise 2: Ingress-nginx Lossless Traffic Switching for Application Update #

The Ingress object is the traffic drainage object designed by Kubernetes. It directly monitors changes in the Endpoints interface list of the Service to update the load-balanced interface list. The current load balancing algorithm used by ingress-nginx has adopted the Lua-written Exponential Weighted Moving Average (EWMA) algorithm to achieve smooth processing of traffic. The following example uses Nginx OSS version of Ingress as an example to help you understand.

Install Ingress:

kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/controller-0.32.0/deploy/static/provider/cloud/deploy.yaml

Verify:

❯ kubectl get pods --all-namespaces -l app.kubernetes.io/name=ingress-nginx
NAMESPACE       NAME                                        READY   STATUS      RESTARTS   AGE
ingress-nginx   ingress-nginx-admission-create-j5f8z        0/1     Completed   0          11m
ingress-nginx   ingress-nginx-admission-patch-btfd4         0/1     Completed   1          11m
ingress-nginx   ingress-nginx-controller-866488c6d4-snp4s   1/1     Running     0          11m

Load the application:

kubectl create -f sample/apple.yaml

sample/apple.yaml:

kind: Pod
apiVersion: v1
metadata:
  name: apple-app
  labels:
    app: apple
    version: apple-v1
spec:
  containers:
  - name: apple-app
    image: hashicorp/http-echo
    args:
    - "-text=apple"
---
kind: Service
apiVersion: v1
metadata:
  name: apple-service
spec:
  selector:
    version: apple-v1
  ports:
  - port: 5678 # Default port for image

Load the application:

kubectl create -f sample/banana.yaml

sample/banana.yaml:

kind: Pod
apiVersion: v1
metadata:
  name: banana-app

labels:
  app: banana
  version: banana-v1
spec:
  containers:
    - name: banana-app
      image: hashicorp/http-echo
      args:
        - "-text=banana"
---

kind: Service
apiVersion: v1
metadata:
  name: banana-service
spec:
  selector:
    version: banana-v1
  ports:
    - port: 5678 # Default port for image

Load Ingress rules:

# sample/ingress.yaml
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: example-ingress
  annotations:
    ingress.kubernetes.io/rewrite-target: /
spec:
  rules:
  - http:
      paths:
        - path: /apple
          backend:
            serviceName: apple-service
            servicePort: 5678
        - path: /banana
          backend:
            serviceName: banana-service
            servicePort: 5678

When you change the routing rules in ingress.yaml, the reverse proxy will reload and update once, causing the connection to be interrupted. To solve this problem, we must prevent Ingress from being changed. Instead, we update the Service object by updating the selector to modify the Endpoints collection. This is because the Ingress update mechanism listens to the Endpoints and automatically hot-reloads the proxy configuration to achieve seamless traffic switching without restart. An example of the update is as follows:

# The service update triggers IPVS to update the Pod addresses in the endpoints
export RELEASE_VERSION=banana-v2
kubectl patch svc default-www -p '{"spec":{"selector":{"version":"'${RELEASE_VERSION}'"}}}'

The lua function that Ingress listens to the Endpoints is as follows:

Link to lua function

-- diff_endpoints compares old and new
-- and as a first argument returns what endpoints are in new
-- but are not in old, and as a second argument it returns
-- what endpoints are in old but are in new.
-- Both return values are normalized (ip:port).
function _M.diff_endpoints(old, new)
  local endpoints_added, endpoints_removed = {}, {}
  local normalized_old = normalize_endpoints(old)
  local normalized_new = normalize_endpoints(new)

  for endpoint_string, _ in pairs(normalized_old) do
    if not normalized_new[endpoint_string] then
      table.insert(endpoints_removed, endpoint_string)
    end
  end

  for endpoint_string, _ in pairs(normalized_new) do
    if not normalized_old[endpoint_string] then
      table.insert(endpoints_added, endpoint_string)
    end
  end

  return endpoints_added, endpoints_removed
end

Exercise 3: Zero-downtime deployment with Traefik #

Because Traefik can interact directly with the Kubernetes Apiserver, switching and deploying traffic with Traefik is more convenient than ingress-nginx. Traefik is also an Ingress object in Kubernetes. In the second exercise, we introduced the method of switching traffic without loss using the selector of the Service. In this third example, we will introduce three other popular methods for zero-downtime deployment: blue-green deployment, canary releases, and A/B testing. Although these three methods are related, they are also different from each other.

With the support of immutable infrastructure in Kubernetes, multiple versions of the same software can serve requests within the same cluster. This allows for interesting experiments with mixing old and new versions to configure routing rules and test the latest version of a production environment. More importantly, the new version can be gradually released, and if there are any issues, it can even be rolled back, all with almost no downtime.

In blue-green deployment, “green” refers to the current stable version of the application, and “blue” refers to the upcoming version with new features and fixes. Both versions are running in the same production environment. At the same time, the proxy server (such as Traefik) ensures that requests sent to private addresses only reach the blue instances. An example is shown in the following figure:

3a-blue-green-deploy

Canary releases further advance blue-green testing by deploying new features and patches to an active production environment in a cautious manner. The routing configuration allows the current stable version to handle most requests, but a limited proportion of the requests are routed to instances of the new “canary” version. An example is shown in the following figure:

A/B testing is sometimes confused with the previous two techniques, but it has its own purpose, which is to evaluate two different versions of an upcoming release to see which version is more successful. This strategy is common in UI development. For example, suppose a new feature will be quickly introduced into the application, but it is not clear how to best expose it to users. To find the answer, two versions of the UI, including the feature, are run simultaneously, and the proxy router sends a limited number of requests to each version. An example is shown in the following figure:

A/B Testing

These techniques are valuable for testing modern cloud-native software architectures, especially compared to traditional waterfall deployment models. If used properly, they can help discover unforeseen regressions, integration failures, performance bottlenecks, and availability issues in production environments before new code enters stable production versions.

The common point of these three methods is that they rely on the convenience of containers and deployments provided by Kubernetes, and use cloud-native networking technologies to route requests to test deployments while minimizing disruption to production code. This is a powerful combination, which is precisely the strength of Traefik-if used wisely, it can effectively reduce the downtime of the overall application to zero.

Summary #

The complexity of updating application traffic involves changes in the application state. The examples above only verify the switching idea of traffic without loss in a limited environment. In a real scenario, we also need to consider the impact of associated applications such as databases and business systems, making it difficult to switch as freely as in the exercise with stateless applications. However, these factors do not prevent us from confirming a fact: Kubernetes can indeed achieve traffic switching without loss through parameters. It is a feasible infrastructure, and you need to understand and master the implementation details of these basic objects in order to create the immutable infrastructure you need through reasonable configuration.