08 System Isolation How to Deal With High Concurrency Traffic Surges

08 System Isolation How to Deal with High Concurrency Traffic Surges #

Hello, I am Xu Changlong, and today I want to talk to you about how to achieve effective system isolation.

I used to work as an architect in an education and training company. During a renewal campaign, our system experienced a massive crash. Around fifty thousand students were operating simultaneously, and a large number of requests instantly overwhelmed our servers. This led to a backlog of requests on the server side, ultimately resulting in the depletion of system resources and a halt in response. We had to restart the services and implement rate limiting on the interfaces before the services could recover.

Upon reflection, we realized that we had a habit of turning common functions and data into intranet services. While this approach enhances service reusability, it also makes our services heavily reliant on intranet services. When the external network is hit by a traffic surge, the intranet is also impacted by the amplified traffic. Excessive traffic easily causes the intranet services to crash, leading to the entire website becoming unresponsive.

After analyzing the incident in detail, we unanimously concluded that the core issue behind this massive system crash was the lack of proper system isolation. The business operations were highly susceptible to mutual impact.

If system isolation is implemented effectively, when faced with high traffic impact, it will only affect the targeted application service. Even if a certain business operation crashes as a result, it will not affect the normal operation of other businesses. This requires our architecture to have the ability to isolate multiple applications and regulate inbound and outbound network traffic. Only then can we ensure system stability.

Split Deployment and Physical Isolation #

In order to improve the stability of the system, we have decided to perform isolation transformation on the system, as shown in the following diagram:

- In other words, each internal and external network service will be deployed in separate clusters. Additionally, each project will have its own gateway and database. External network services and internal networks must be accessed through the gateway, and data synchronization from external networks to internal networks is achieved using Kafka.

Gateway Isolation and Resilient Circuit Breaker #

In this transformation plan, there are two types of gateways: the external gateway and the internal gateway. Each business has its own independent external gateway (which can be adjusted as needed) to limit the external traffic. When the instantaneous traffic exceeds the system’s capacity, the gateway will queue and block the requests that exceed the limit for a while until the server’s peak QPS passes. Compared to directly rejecting client requests, this approach provides a better user experience.

To access the internal interface from the external network, it is necessary to go through the internal gateway. When the external network requests the internal interface, the internal gateway will authenticate the source system and the target interface. Only registered and authorized external services can access the authorized internal interfaces. This strict management ensures the control of interface invocations between systems.

At the same time, during development, it is important to constantly monitor and apply circuit breakers to the internal gateway when the traffic increases. This prevents the external services from relying heavily on the internal interfaces, maintains the independence of the external services, and ensures that the internal network is not impacted by the external traffic. Additionally, the external services should be able to operate independently for at least one hour even if the connection to the internal gateway is disconnected.

However, as you may have noticed, this type of isolation prevents real-time invocation of the internal network interfaces, which can cause significant challenges for development. It is important to note that common external services often frequently rely on internal network services to retrieve basic data in order to function properly. Moreover, if both the internal and external networks make decisions based on the same set of data, it can easily lead to confusion.

Reducing Interaction with the Internal API #

To prevent shared data from being modified by multiple systems at the same time, we will push the data and inventory of the participating activities during the event, and then automatically lock them. This approach can prevent other businesses and the backend from modifying the data. If you want to prohibit sales, you can directly call the frontend business interface through the backend. During the event, new products can also be added to the external business, but they can only be added and not removed.

This implementation ensures the uniqueness of data decisions for a period of time and also ensures the isolation between the internal and external networks.

However, you should note that the lock operation here is only to ensure the synchronization of data and should not be locked after the peak period of the event. Otherwise, it will make our business less flexible.

Because we need to synchronize the transaction results of the event back to the internal network, the external network can continue trading during the synchronization. If the lock is not maintained, the data flow may accidentally become bidirectional synchronization, which can easily lead to confusion. It would be difficult to fix the system if it encounters problems as shown in the following figure:

From the figure, we can see that the two systems have completely independent data because there is no real-time interactive interface. However, when transmitting external network data back to the internal network, if the inventory is transferred between the two systems, it is easy to encounter synchronization conflicts, resulting in confusion. So how can we avoid similar problems?

In fact, only by ensuring that data synchronization is one-way, can we cancel the mutual lock operation. We can stipulate that all inventory decisions are made by the external network business service. The backend must go through the external network business decision before performing any inventory operations. This approach is more flexible than locking data. When the external network completes a transaction, it can only push the transaction result to the internal network through a queue.

In fact, it is not easy to use queues to synchronize data. There are many processes and details that we need to refine in order to reduce cases of asynchronization. Fortunately, the queue we use is very mature and provides many convenient features to help us reduce the risk of asynchronization.

Now let’s take a look at the overall data flow, as shown in the following figure:

The backend system pushes data to Redis or the database, and the external network service synchronizes the results to the internal network through Kafka. Inventory deduction must be notified to the external network service and only after the deduction is successful can the synchronization operation be performed.

Distributed Queue Flow Control and Offline Synchronization #

As we mentioned earlier, we use Kafka distributed queue for data synchronization between the external network and the internal network mainly because it has the following advantages:

The queue has good throughput and can dynamically scale to handle various traffic scenarios.
The number of internal network consumer threads can be dynamically controlled to achieve controllable internal network traffic.
During peak periods, the internal network consumer service can go offline temporarily for maintenance and upgrade operations.
If the internal network service encounters a bug that leads to data loss, we can replay the messages in the queue to achieve re-consumption.
Kafka synchronizes partitioned messages, ensuring that messages are ordered and rarely out of sequence, thus helping us achieve sequential synchronization.
The content of messages can be saved for a long time, making it convenient and transparent to search for them by TraceID, which is beneficial for troubleshooting various issues.

Data synchronization between two systems is a complex and cumbersome task, but using Kafka can turn this real-time process into an asynchronous one. Combined with message replay and traffic control, the whole process becomes much easier.

The most difficult step in “data synchronization” is ensuring order. Next, let me explain how we achieved this.

When a user places an order for a product in the external network business system, the external network service will deduct the inventory from the local cache. After the inventory deduction is successful, the external network will create an order and send a “create order” message to the message queue. When the user pays the order in the external network business, the order status will be updated to “paid”, and a “payment success” message will be sent to the internal network via the message queue. The code for sending messages is as follows:

    type ShopOrder struct {
       TraceId    string `json:trace_id`      // trace id for troubleshooting
       OrderNo    string `json:order_no`      // order number
       ProductId  string `json:"product_id"`  // course id
       Sku        string `json:"sku"`         // course specification sku
       ClassId    int32  `json:"class_id"`    // class id
       Amount     int32  `json:amount,string` // amount, in cents
       Uid        int64  `json:uid,string`    // user uid
       Action     string `json:"action"`      // current action: create: create order, pay: pay order, refund: refund, close: close order
       Status     int16  `json:"status"`      // current order status: 0 created, 1 paid, 2 refunded, 3 closed
       Version    int32  `json:"version"`     // version, a time version generated by adding milliseconds to the current time, helps backend compare operation versions, if the version of the received message is smaller than the time of the last operation, the event will be ignored
       UpdateTime int32  `json:"update_time"` // last update time
       CreateTime int32  `json:"create_time"` // order creation date
    }

    // send message to internal network order system
    resp, err := sendQueueEvent("order_event", shopOrder{...omitted}, partition of the message)
    if err != nil {
      return nil, err
    }

    return resp, nil

As you can see, when sending a message, we have already determined which partition the message should go to based on certain criteria (such as the order number and uid). Messages within the same partition in Kafka are ordered.

Why do we need to ensure message consumption order? The core reason is that our data operations must be executed in order. If they are not, many strange scenarios may occur.

For example, in the sequence of “user creates order, pays order, refunds order”, the consumer process may receive the refund message first. However, since the create order and pay order messages have not been received yet, the refund operation cannot be performed at this time.

Of course, this is just a simple example. If there are more steps out of order, the data will be even more chaotic. Therefore, if we want to achieve data synchronization, we must try our best to ensure that the data is in order.

However, as we mentioned when discussing the advantages of Kafka, although the queue can ensure orderliness most of the time, there can still be out-of-order situations in extreme scenarios. Therefore, we need to make our business logic compatible. Even if the problem cannot be automatically resolved, relevant logs should be recorded for subsequent troubleshooting.

It is not difficult to see that due to the requirement of “orderliness”, our data synchronization faces great challenges. Fortunately, Kafka can save messages for a long time. If a problem occurs during synchronization, we can not only use logs to fix the failure but also replay the traffic during the failure period (replay must ensure synchronization idempotence).

This feature allows us to perform flexible operations, such as temporarily suspending the internal network consumer service during peak traffic periods and resuming it once the system is stable, ensuring the completion of user transactions.

In addition to data synchronization, we also need to control the traffic in the internal network. We can control the speed of internal network traffic by dynamically controlling the number of threads.

Alright, that’s all for today’s lesson. I believe you now have a deep understanding of “how to achieve system isolation”. I hope you can implement this solution in practice during the production process.

Summary #

System isolation requires a lot of time and effort to polish. This lesson discussed key features that can affect system stability. Let’s review them as a whole.

To achieve system isolation, we set up an interface gateway between the public network service and the intranet service. Only through the gateway can the intranet interface service be accessed. During periods of high traffic, we employ the method of circuit breaking for intranet interface interactions to protect the intranet. All data required by the public network is pushed to the local cache of the mall through intranet scripts before the start of the event to ensure business operation.

At the same time, successful transactions and synchronous information from the public network are delivered to the intranet through a distributed, scalable, and replayable message queue, enabling the adjustment of consumer threads based on internal load to achieve controllable message consumption. Through this, we achieve synchronization and interaction between the two systems.

I have created a mind map of the key knowledge from this lesson for your reference:

Thought question #

What methods can be used to periodically check for unsynchronized data between two systems?

Feel free to discuss and exchange ideas with me in the comments section. See you next class!