27 Sae Application Batch Release and Best Practices for Zero Downtime Offline

27 SAE Application Batch Release and Best Practices for Zero-Downtime Offline #

Application release and service upgrades have always been an exciting yet worrisome task for developers and operations staff.

The excitement comes from the ability to introduce new features that provide users with more capabilities and value. The worry stems from the possibility of unexpected issues during the release process that could impact the stability of the business. Indeed, the likelihood of encountering problems is higher during application release and service upgrades. In this article, we will discuss how to ensure the graceful shutdown of services during the release process by utilizing the Serverless Application Engine (SAE) in a Serverless architecture.

During the normal release process, have you ever encountered the following problems:

  • Requests being interrupted during the release process?
  • Downstream service nodes being offline while upstream continues to call the offline nodes, resulting in request errors and business anomalies?
  • Data inconsistency caused by the release process, requiring the cleanup of dirty data.

Sometimes, we arrange releases in the early hours of the morning when business traffic is low. This often leads to anxiety, sleep deprivation, and great hardship. So, how can we solve these problems and ensure stable and efficient application releases without causing any business losses? First, let’s analyze the root causes of these issues.

Scenario Analysis #

image.png

The above diagram describes a common scenario of developing applications using a microservices architecture. Let’s take a look at the service invocation relationships in this scenario:

  • Services B and C register with the service registry, and services A and B discover the services they need to call from the service registry.
  • Business traffic is routed from the load balancer (SLB) to service A. Service A’s instances are configured with health checks on the SLB. When an instance of service A stops, the corresponding instance is removed from the SLB. Service A calls service B, which in turn calls service C.

The figure shows two types of traffic: North-South traffic (business traffic forwarded to the backend servers through the SLB, such as the business flow -> SLB -> A calling path) and East-West traffic (traffic called through the service discovery of the service center in the registry, such as A -> B calling path). Let’s analyze these two types of traffic separately.

North-South Traffic #

Problems with North-South Traffic #

During the release of service A, when the instance A1 stops, the SLB detects this using health checks and removes the instance from the SLB. Instance A1 relies on the SLB’s health check to be removed, which usually takes a few seconds to a dozen seconds. During this time, if continuous traffic is directed to the SLB, some requests may continue to be routed to instance A1, resulting in failed requests.

How can we ensure that traffic passing through the SLB does not report errors during the release of service A? Let’s take a look at how SAE addresses this issue.

Graceful Upgrade Solution for North-South Traffic #

image.png

As mentioned earlier, the reason for request failures is that the backend service instance is stopped first and then removed from the SLB. Can we remove the instance from the SLB first and then perform the upgrade?

Based on this idea, SAE provides a solution utilizing the capabilities of Kubernetes services. When users bind an SLB to an application using SAE, SAE creates a service resource in the cluster and associates the application’s instances with the service. The CCM component is responsible for SLB purchase, creation of SLB virtual server groups, and adding the ENI network card associated with the application instance to the virtual server group. Users can access the application instance through the SLB. When an application is released, CCM first removes the associated ENI from the virtual server group, and then upgrades the instance to ensure that traffic is not lost.

This is SAE’s guarantee for North-South traffic during the application upgrade process.

East-West Traffic #

Issues with East-West Traffic #

After discussing the solution for North-South traffic, let’s take a look at the East-West traffic. In the traditional release process, when a service provider stops and restarts, the process of service consumers perceiving the stop of service provider nodes is as follows:

image.png

  1. Before service publication, the consumer invokes the service provider according to the load balancing rules, and the business operates normally.
  2. Service provider B needs to publish a new version, so it first performs an operation on one of the nodes, starting with stopping the Java process.
  3. The process of stopping the service can be divided into active deregistration and passive deregistration. Active deregistration is quasi-real-time, while the time for passive deregistration is determined by the registration center, with the worst case scenario requiring up to 1 minute.
  4. If the application is stopped normally, the Shutdown Hook of Spring Cloud and Dubbo frameworks can be executed normally, and the time spent in this step can be ignored.
  5. If the application stops abnormally, such as stopping it directly using kill -9, or when the Java application is not process 1 of the Docker image and the kill signal is not passed to the application during image construction, the service provider will not actively deregister the service node. Instead, it will be passively removed by the registration center due to heartbeat timeout after a certain period of time.
  6. The service registry notifies the consumer that one of the service provider nodes has been offline. There are two ways to do this: push and polling. Push can be considered quasi-real-time, while the polling time is determined by the polling interval of the service consumer, with the worst case scenario requiring up to 1 minute.
  7. The service consumer refreshes the service list and perceives that one of the service providers has taken one node offline. This step does not exist for the Dubbo framework, but for Spring Cloud’s load balancing component Ribbon, the default refresh time is 30 seconds, with the worst case scenario requiring up to 30 seconds.
  8. The service consumer no longer invokes the offline node.

During the process from step 2 to step 6, Eureka takes up to 2 minutes in the worst case scenario, while Nacos takes up to 50 seconds. During this time, requests may encounter problems, so various errors may occur during the release, affecting the user experience. After the release, it is necessary to fix the dirty data that was halfway through execution. Finally, it becomes necessary to schedule the release at two or three in the morning every time, which is filled with anxiety, lack of sleep, and great hardship.

Graceful Upgrade Solution for East-West Traffic #

image.png

Based on the analysis above, we can see that in the traditional release process, the client has a period of service call error, which is caused by the client not perceiving the offline instances of the server in a timely manner. In the traditional release process, the main method is to notify consumers through the registry to update the list of service providers. So, can we bypass the registry and let the service provider directly notify the service consumer? The answer is yes, and we mainly did two things:

  1. Before and after the service provider application is published, it proactively deregisters the application from the registry and marks it as offline. This transforms the original process of deregistering the service during the stop process to deregistering the service during the pre-stop stage.
  2. When a service consumer request is received, it is processed normally first, and then the service consumer is notified that the node is offline. The service consumer immediately removes this node from the call list. After that, the service consumer will no longer invoke the offline node. This allows the service provider to directly notify consumers of their removal from the call list, instead of relying on the registry for push notifications as before.

With this solution, the time required for offline perception is greatly reduced, from minutes to quasi-real time. This ensures that the application can gracefully handle business when it goes offline without any loss.

Staged Release and Gray Release #

The above section describes some of the capabilities of SAE in handling graceful offline scenarios. In the process of application upgrade, it is not enough to just gracefully take instances offline. It also requires a set of accompanying release strategies to ensure that our new business is available. SAE provides the capabilities of staged release and gray release, making the application release process more worry-free and effortless.

Let’s first introduce gray release. Suppose an application consists of 10 application instances, with each instance deployed using version Ver.1. Now, we need to upgrade each application instance to version Ver.2.

image.png

As can be seen from the figure, during the release process, 2 instances are first put into the gray release phase. After confirming that the business is operating normally, the remaining instances are gradually released in batches. Throughout the upgrade process, there are always instances in the running state. Each instance goes through the process of graceful offline according to the above solution, ensuring business continuity.

Now let’s take a look at staged release. Staged release supports both manual and automatic phased release. Let’s consider the same 10 application instances. Suppose we deploy all the application instances in 3 batches. According to the staged release strategy, the release process is shown in the figure below, and will not be further explained.

image.png

Finally, a demonstration is provided for the process of grayscale release of applications on SAE. Click the link to watch the demonstration: https://developer.aliyun.com/lesson202619009