44 Remember a Double Eleven Shopping Event Performance Bottleneck Adjustment

44 Remember a Double Eleven shopping event performance bottleneck adjustment #

Hello, I’m Liu Chao. Today, let’s talk about the Double 11 shopping rush. Due to the complexity of the scene, the main focus of this talk is to inventory the performance bottlenecks that frequently occur in various businesses and provide corresponding optimization solutions. However, the optimization solutions will not be fully elaborated, and their specific implementations will not be deeply explained. You can ask questions in the comment section based on what you have learned and accumulated in this column, and I will answer them one by one. Let’s get to the point.

Every year’s Double 11 is a headache for many development departments. Because this holiday is quite special, companies usually prepare a large number of promotional activities. The corresponding momentary high concurrent requests are a big test for the system.

I still remember the first time our company’s online store participated in the Double 11 shopping event. The discounts were very generous, and the purchase volume was also significant. The TPS of the interface for submitting orders reached 100,000 at one point. During the first wave of shopping rush, the backend service monitoring showed that the various indicators of the server exceeded 70%, and the CPU was constantly at 400% (4-core CPU). The database disk I/O was constantly at 100%. Because the instantaneous log volume was very large, our backend service monitoring was unable to obtain real-time monitoring data in a short period of time, and a series of abnormal alerts began to appear.

A more serious system problem occurred during the second wave of shopping rush. Since we found that the backend service was under a lot of pressure during the first wave, we horizontally scaled the service. However, this did not alleviate the pressure on the service. On the contrary, our system crashed quickly during the second wave of shopping.

This event exposed many problems. First, the lack of rate limiting caused the system to lag due to the unexpectedly high request volume. Second, the functionality of distributing purchase quotas based on Redis-based distributed locks threw a large number of exceptions. Third, we misjudged the effect of horizontal scaling of services. In fact, the performance bottleneck of the first wave of shopping rush was in the database, and horizontal scaling of services increased the pressure on the database, causing a counter effect. Lastly, in the event of service downtime, we lost business requests for asynchronous processing.

Next, using this case as a background, I will focus on explaining how to optimize performance bottlenecks in the shopping rush business.

Purchase Process #

Before discussing specific performance issues, let’s first understand a typical purchase process. This will help us better understand the performance bottlenecks and optimization process of a purchase system.

  • After logging in, the user will enter the product details page. At this time, the purchase of the product is in countdown state, and the purchase button is disabled.
  • When the countdown time ends, the user clicks on the product to purchase, and the user needs to wait in line to obtain the purchase qualification. If the purchase qualification is not obtained, the purchase activity ends; otherwise, the user enters the submission page.
  • The user completes the order information and clicks on “Submit Order”. At this time, the system checks the inventory, creates the order, and enters the inventory lock state. Then, the user pays for the order.
  • After the payment is successful, the third-party payment platform will generate a payment callback. The system updates the order status through the callback, deducts the actual stock from the database, and notifies the user of the successful purchase.

img

Performance Bottlenecks in Flash Sale Systems #

After familiarizing ourselves with a typical flash sale business process, let’s take a look at the potential performance bottlenecks in a flash sale.

1. Product Details Page #

If you have ever participated in a flash sale, you may have encountered a situation where the product details page is almost impossible to open just before the sale starts.

This is because most users continuously refresh the flash sale product page before the sale starts, especially within the last minute of the countdown. The number of requests to view the product details page increases dramatically during this time. If the product details page is not optimized, it can easily become the first bottleneck in the entire flash sale system.

To address this issue, a usual approach is to pre-generate the entire flash sale product page as a static page and push it to CDN nodes. Additionally, the browser can cache the static resource files of the page. By using both CDN caching and local browser caching, we can optimize the performance of the product details page.

2. Flash Sale Countdown #

In the product details page, there is a flash sale countdown timer that relies on the server-side time. The initialization time needs to be obtained from the server, and when a user clicks to purchase, the server needs to determine if the flash sale time has arrived.

If the product details page retrieves the latest time from the backend every time it refreshes, this will inevitably overload the backend server. Instead, we can initialize the timer with the client-side time and periodically refresh it by synchronizing with the server. This refresh interval should be random to avoid concentrated requests to the server. This approach can prevent users from actively refreshing the server’s time synchronization interface.

3. Getting Purchase Qualification #

You may wonder why there is a step to obtain purchase qualification even though the stock quantity already restricts users during the flash sale.

After entering the order details page, users need to fill in relevant order information such as shipping address and contact information. During this process, many users may hesitate or even abandon their purchase. If this step guarantees a successful purchase, we would only allow users with equal stock availability to enter. Once a user abandons the purchase, these products may not be available for other users to buy again, greatly reducing the flash sale’s sales volume.

By adding a step to obtain purchase qualification, we can allow a larger number of users than the available stock to enter the order submission page. This ensures a sufficiently large number of users to submit orders, maximizing the sales volume of the flash sale products.

The concurrency of obtaining purchase qualification will be very high and should be based on distributed systems. Typically, we can control the issuance of purchase qualification using Redis distributed locks.

4. Order Submission #

Due to the high volume of requests at the entrance of the flash sale, it may consume a lot of bandwidth. To avoid affecting the submission of orders, I recommend using a separate subdomain for order submission, distinct from the flash sale subdomain, and binding them to different network servers.

When a user clicks to submit an order, we need to first verify the stock. If the stock is sufficient, the user’s stock is deducted from the cache before generating the order. If stock verification and deduction are both based on the database, each operation will create a high momentary concurrency, which will put pressure on the database and result in performance bottlenecks. Similar to obtaining purchase qualification, we can optimize the stock deduction by using distributed locks.

Since we have already cached the stock, querying and freezing the stock during the order submission will not cause performance bottlenecks for the database. However, after this step, there is an order idempotent check. To improve system performance, we can also use distributed locks to optimize this check.

Order information is typically saved in a database table. In a single-table single-database scenario, when encountering a large number of requests, especially during a momentary high concurrency, disk I/O, database connection requests, bandwidth, and other resources may become performance bottlenecks. In such cases, we can consider sharding the order table to multiple databases to improve system concurrency.

5. Payment Callback Business Operations #

After a user completes the payment for an order, there is usually a callback from a third-party payment platform to update the order status.

In addition to the order status update, there may also be a requirement to deduct the stock from the database. If our stock is queried and deducted based on caching, the stock deduction during order submission only deducts the stock from the cache. To reduce the concurrency on the database, we can choose to deduct the stock from the database after the user completes the payment during the payment callback.

Furthermore, there may be a service to send SMS notifications for successful order purchases and some platforms may offer cumulative points services.

After the payment callback, we can use asynchronous submission to handle other business operations besides order status update, such as stock deduction, points accumulation, and SMS notifications. Typically, we can implement asynchronous submission of these operations using message queues (MQ).

Performance Bottleneck Optimization #

After understanding the potential performance bottlenecks in various business processes, let’s discuss further optimization for performance issues that may still arise after implementing conventional optimizations for the online mall.

1. Optimization of Flow Control Implementation #

Flow control is a commonly used fallback strategy. Whether it is for countdown request interfaces or purchase entrance, the system should set the maximum concurrent access quantity to prevent system abnormalities caused by an unexpectedly high concentration of requests.

Usually, we implement high-concurrency request interface flow control at the gateway level. If we use Nginx as a reverse proxy, we can configure flow control algorithms in Nginx. Nginx implements flow control based on the leaky bucket algorithm, which ensures real-time processing speed of requests.

Nginx includes two flow control modules: ngx_http_limit_conn_module and ngx_http_limit_req_module. The former is used to limit the number of requests from a single IP within a unit of time, while the latter is used to limit the total number of requests from all IPs within a unit of time. The following are the configurations for the two types of flow control:

limit_conn_zone $binary_remote_addr zone=addr:10m;

server {
    location / {
        limit_conn addr 1;
    }
}

http {
    limit_req_zone $binary_remote_addr zone=one:10m rate=1r/s;
    
    server {
        location / {
            limit_req zone=one burst=5 nodelay;
        }
    }
}

In addition to flow control at the gateway level, we can also implement interface-level flow control based on service layers using Zuul RateLimit or Guava RateLimiter.

2. Traffic Shaping #

When a large number of requests enter the backend service system instantly, the first step is to acquire purchase qualifications through Redis distributed locks. At this time, we may encounter a large number of “JedisConnectionException Could not get connection from pool” exceptions.

This exception is a Redis connection exception. Since our Redis cluster was deployed in sentinel mode, which is a kind of master-slave mode, we implemented Redis operations based on the master database. This easily leads to performance bottlenecks when operating a Redis instance under high concurrency.

You might think of implementing cluster sharding. However, for distributed locks, implementing cluster sharding only increases performance overhead because we need to use Redission’s red lock algorithm, which requires locking each instance in the cluster.

Later, we replaced the Jedis plugin with the Redission plugin. While Jedis has blocking read/write I/O operations and synchronous method calls, Redission is based on the Netty framework, which provides non-blocking I/O operations and asynchronous method calls.

However, in cases of extremely high instant concurrency, similar problems may still occur. In such cases, we can consider adding a waiting queue in front of the distributed lock to alleviate the concentration of requests during the purchase process, like traffic shaping. When the key value of a request is placed in the queue and the requesting thread enters a blocking state, the thread will be awakened to acquire purchase qualifications when the key value is obtained from the queue.

3. Data Loss Issues #

There is always a possibility of data loss, whether due to service outage or asynchronous message sending to the message queue (MQ). For example, when a third-party payment callback system writes an order successfully, if the application service happens to crash at that moment and the message has not been stored in the MQ, even if we restart the service, the request data will not be recovered.

Retry mechanism is one solution to restore lost messages. In the above callback example, we can write an asynchronous message status into the database while writing the order. Afterwards, we return the successful result of the third-party payment operation. After the asynchronous business processing request succeeds, we update the asynchronous message status in the database table.

Assuming we restart the service, the system will query the database for unupdated asynchronous messages during the restart. If any exist, it will generate MQ business processing messages again for various business parties to consume and process the lost request data.

Summary #

Reducing the number of database operations during the rush-buying process and shortening the rush-buying process are the core points of designing and optimizing the rush-buying system.

The performance bottleneck of the rush-buying system is mainly in the database. Even if we have horizontally scaled our services, the database is still unable to simultaneously respond to and process so many requests when traffic comes in instantly. We can partition and shard the rush-buying business table, improve the database’s processing ability, and enhance the system’s concurrent processing capability.

In addition, we can also disperse momentary high-concurrency requests to flatten the traffic peak, which is the most commonly used method. By using a queue to allow requests to wait in line, they can enter the backend service in an ordered and limited manner, and finally perform database operations. When our queue is full, we can discard overflow requests, which is called rate limiting. By implementing rate limiting and traffic flattening, we can effectively ensure that the system does not crash and maintain system stability.

Reflection Question #

After submitting an order, the system enters the payment stage, where the inventory is frozen. Generally, we give users a certain amount of waiting time. However, this can easily lead to malicious users locking up inventory, causing users who have successfully ordered a product to be unable to pay for it. How do you think we should optimize the design of this business operation?

First of all, we need to address the issue of users maliciously locking inventory and preventing others from purchasing the product. One possible solution is to implement a time-limited reservation system. After a user submits an order, the inventory will be reserved for a certain period of time, such as 10 minutes. During this time, the inventory will be marked as reserved and cannot be locked by other users. However, if the user fails to complete the payment within the specified time, the inventory reservation will be automatically released and made available to other users. This can prevent malicious locking of inventory while ensuring a fair purchasing process.

Additionally, we can implement measures to detect and prevent suspicious or abnormal user behavior. For example, if a user repeatedly engages in malicious inventory locking or exhibits other suspicious activities, the system can automatically flag and investigate the user’s account. This can help deter potential malicious behaviors and ensure a better user experience for legitimate users.

Furthermore, clear communication and transparency can help optimize the design of this business operation. It is important to clearly communicate to users the reservation period and the consequences of not completing the payment within that time. Providing reminders and notifications to users during the reservation period can also help improve the conversion rate from order submission to successful payment.

In summary, optimizing the design of this business operation involves implementing a time-limited reservation system, detecting and preventing suspicious user behavior, and improving communication and transparency with users. By balancing the needs of preventing malicious inventory locking and ensuring a fair purchasing process, we can create a better user experience for all customers.