18 Traffic Splitting How to Relieve Traffic Pressure Through Architectural Design

18 Traffic Splitting How to Relieve Traffic Pressure Through Architectural Design #

Hello, I’m Xu Changlong.

Today, I will use live interaction as an example to show you how to deal with traffic pressure in a read-heavy and write-heavy situation. Typically, this type of service belongs to real-time interactive services, which have high requirements for timeliness. This often means that we cannot reduce the pressure on core data by using read caching. Therefore, in order to reduce the pressure on these interactive servers, we can start from the architecture and perform some flexible splitting and design modifications.

In fact, these designs are implemented in a mixed way to provide services externally. To help you better understand, I will explain specific scenarios in live interaction. Generally, live scenarios can be divided into scenarios with predictable user volume and scenarios with unpredictable user volume. The design for these two scenarios is significantly different. Let’s take a look at each of them separately.

Service with Estimable User Count: Game Room Creation #

I believe many players of competitive games have had similar experiences of creating a room before playing online. This design is mainly achieved by setting a maximum number of rooms that a server can open, limiting how many users can be served simultaneously by a server.

Let’s analyze how resource allocation is done from the server-side perspective. After creating a room, users can invite other players to join the game and compete against each other using the room ID. Both the room owner and the invited players will be assigned to the same service cluster for interaction by the scheduling service using the room’s identifier.

Here, let me point out that creating a room does not necessarily require active input from the game users. It can be set up so that a room is automatically allocated when the user starts the game. This approach not only allows for early estimation of user count but also facilitates better planning and control of our service resources.

How do we evaluate how many people can be online simultaneously on a server?

We can use load testing to determine the number of users online on a single server, thus accurately estimating the required bandwidth and server resources. This calculation can help determine how many resources a cluster (which may include several servers) needs, and how many people it can accommodate online for interaction. The scheduling service can then allocate resources and assign new room owners to idle service clusters.

The resulting implementation is shown in the diagram below:

Image

As shown in the diagram, during the room creation phase, our client undergoes scheduling through requests made to the scheduling service before entering the regional server cluster. The scheduling server periodically receives information on the online users of various server groups, allowing it to evaluate how many users need to be assigned to different regional clusters. Once the client receives the scheduling data, it will use the provided token to apply for room creation in the corresponding region.

After the room is created, the scheduling service will maintain a list and information of this room within the local cluster, which will be provided to other players who want to join the game. The joining players will also connect to the corresponding regional server of the room and interact with the room owner and other players in real-time.

Not only is this room allocation design used in competitive games, but it is also employed in many other scenarios, such as interactive online classrooms. With this design, we can accurately control resources, ensuring that the number of users does not exceed our server’s design capacity.

Services with Unpredictable User Volume #

However, there are many scenarios that are random, and we cannot control how many users will enter this server for interaction.

In the case of nationwide live streaming, the number of users that will access cannot be confirmed. Therefore, many live streaming services first predict the user volume based on the past performance of the anchor. By estimating the volume, they arrange their live broadcasts to relatively idle server clusters in advance. They also prepare scheduling tools in advance, such as controlling exposure to delay user entry into the live streaming, thus buying more time for server scheduling to dynamically scale up.

Since this type of service cannot predict the number of users, the previous server grouping mode is not suitable for this approach, and a higher level of scheduling is needed.

Let’s analyze the scenarios. For live streaming, common user interactions include chatting, answering questions, liking, giving rewards, and shopping. Considering the different characteristics of these forms, we will analyze them one by one based on their key points.

Chatting: Merging Messages #

Chat content is generally short, in order to improve throughput, user chat content is usually placed in a distributed queue for transmission, which can help delay the write pressure.

In addition, in scenarios such as liking or when multiple users enter the same content repeatedly, big data real-time analysis can be used to analyze users’ input and compress and organize a large amount of repetitive content, filtering out some useless information.

Image

The compressed and organized chat content will be distributed to multiple chat content delivery servers. Long connections of users in live streaming rooms will receive push notifications for message updates. Afterwards, the client will bulk retrieve data from the designated content delivery server cluster and replay the data in chronological order. Please note that this method is only suitable for situations where there is intense activity. If the user volume is small, real-time interaction can be done via long connections.

Answering Questions: Instantaneous Peak in Data Retrieval #

In addition to interactive content with huge traffic like chat interaction, there are also some special interactions, such as answering questions. The teacher in a live streaming room sends a question, and the question message is broadcasted to all users. After receiving the message, the client will retrieve the question data from the server.

If there are 100,000 users online, it is very possible that there will be an instantaneous demand for 100,000 users to retrieve the question data from the server. This amount of data requests requires a significant investment in servers and bandwidth, but the cost-effectiveness is not high.

In theory, we can staticize the data and block this traffic through CDN. However, to avoid sudden peaks, it is recommended to add random delay of a few seconds to the retrieval process on the client side before sending the request. This can greatly delay the server load and provide a better user experience.

Please bear in mind that for the client, if this service fails, it is not advisable to frequently request retries, as it may overwhelm the server. If it is necessary to do so, it is recommended to use annealing algorithms to control the timing of retries, in order to prevent the server from receiving a large number of requests due to temporary malfunctions and crashing.

If it is a teaching scenario in live streaming, there are two techniques to alleviate server load. The first technique is to preload the quiz questions to the client for download on the day of the class. This can help reduce the load on real-time retrieval.

The second technique is for cases where questions are answered quickly. When the teacher releases the question, it is set to take effect 5 seconds later before the question is displayed. This ensures the “on-time” reception of question information by all live streaming users, avoiding inconsistent timing of question reception.

As for non-quiz type questions, after the user has answered the question, we can first locally pre-judge the paper on the client side, showing the correct answers and explanations to the user. Then, asynchronously and slowly submit the user’s answer results to the server during the live streaming period. This guarantees that the server will not be overwhelmed by the user’s momentary traffic.

Liking: Merging Client Interactions #

For the liking scenario, let me explain from the perspectives of the client and server.

Let’s first look at the client side. Many times, the client does not need to submit all user interactions in real time, as there are many mechanical repetitive actions that do not require high real-time requirements.

For example, if a user continuously likes 100 times within a short period of time, the client can merge these interactions into one message (e.g. the user liked 10 times within 3 seconds). I believe that as clever as you are, you can apply this technique of merging interactions to more scenarios, such as when a user continuously gives 100 gifts.

By using this method, server load can be significantly reduced, ensuring the popularity of the live streaming room while saving a lot of network resources. It’s a win-win situation.

Praise: Server-Side Tree-like Multi-layer Summary Architecture #

Let’s take a look at how to design the server to alleviate the request pressure in the context of the praise scenario.

If our cluster QPS exceeds 100,000 and the server’s data layer can no longer handle such pressure, how can we deal with high-concurrency writes and reads? Weibo has done a similar case to mitigate the traffic of user’s praise requests. This approach is suitable for counters with low consistency requirements, as shown in the diagram below:

Image

This approach can randomly distribute the user’s praise traffic to different write cache services. The first layer of write cache locally summarizes the requests in real-time to alleviate the large number of user requests. After periodically aggregating the updated data, it is submitted to the second-level write cache.

Later, when all the upper-level services in the shard of the second-level summary are available, the summarized values are finally synchronized to the core cache service. Then, the final result is accumulated through the core cache. Finally, it is replicated to multiple sub-query node services through master-slave replication, providing users with query results.

In addition, to digress a little, Weibo is a heavy user of Redis. Later, due to the large amount of praise data, caching the number of praises in Redis wasted a lot of memory (you can review the content of the previous lesson about the jmalloc algorithm). Therefore, they implemented their own praise service to save memory.

Rewards & Shopping: Server-Side Sharding and Real-time Sharding Expansion #

In the scenarios of rewards and shopping, we need to provide transactional consistency services for inventory and amount.

Due to the requirement for transactional consistency, we cannot provide this type of service as a multi-layer buffering solution. Moreover, the data characteristics of such services are read-intensive and write-intensive. Therefore, we can implement this type of service through data sharding, as shown in the diagram below:

Image

It’s easy to understand the diagram, right? We can divide the data based on user IDs using a hash function. Through a gateway, different user IDs are distributed to different shard services based on the range obtained by taking the modulus of the user’s UID. Then, the services within each shard perform real-time calculations and updates for similar requests in memory.

Through this approach, load splitting can be easily and conveniently implemented. However, the disadvantage is that hash allocation can easily cause individual hotspots. When we can’t handle the traffic, we need to scale out.

But if an individual server fails using this hash method, it will result in an incorrect hash mapping and requests will be sent to the wrong shard. There are many similar solutions, such as the consistent hash algorithm. This algorithm allows for local area expansion without affecting the entire cluster, but it is often not widely applicable and difficult to control manually. It requires supporting tools for development.

In addition to this, I recommend another method called the tree-like hot migration slicing method, which is similar to virtual buckets.

For example, we divide the complete set of data into 256 parts, where each part represents a bucket. 16 servers handle 16 buckets each. When certain servers are under heavy load, we can add two subscribing servers to synchronize the data as master-slave replicas, and migrate the data of the 16 buckets on the overloaded server.

After successful synchronization and migration, the incoming traffic to this server is split and forwarded to two 8-bucket servers, which respectively continue to serve these two subscribing servers. The original server can then be removed and reclaimed.

Once the service switch is successful, because it is a complete migration, these two services also synchronize 8 buckets of data that do not belong to them. At this time, the new server can iterate through its own stored data and delete the data that does not belong to it. Of course, when synchronizing the data of the 16-bucket service, these data can be filtered out. This method is appropriate for all stateful sharded data services, such as Redis, MySQL, and so on.

The challenge with this service lies in the fact that the client does not directly request the shard but instead requests the data service through a proxy service. Only through the proxy service can we dynamically update and schedule traffic, achieving smooth and seamless traffic forwarding.

Finally, how do we let the client know which shard to request in order to find the data? I will share two common methods with you:

The first method is that the client uses an algorithm to find the shard. For example: hash(uid) % 100 = bucket ID, and then look for the corresponding shard through the bucket ID in the configuration.

The second method is that the data server receives the request and forwards it to the shard where the data resides. For example, the client requests shard A, and then find that the data corresponds to shard B according to the data algorithm and shard configuration. At this time, shard A will forward this request to shard B, and when shard B finishes processing, it returns the data to the client (whether shard A or shard B returns depends on whether the client triggers redirection or the server forwards the request).

Service Degradation: Distributed Queue Consolidation Buffer #

Even with so many optimizations in place, our service still cannot handle excessively high spikes in traffic.

In such situations, we can perform some service degradation operations by merging modifications or implementing gateway throttling through a queue. Although this sacrifices some real-time capability, in reality, many metrics may not be as important as we imagine. For example, in the case of Weibo’s like statistics, if the client is unable to make a server request for a like, the data will be temporarily stored on the client side. When the user views the data, they will only see the short-term historical figures, not real-time figures.

The difference between 105,000 likes and 103,000 likes is not significant. When the server becomes idle later on, the results will catch up and eventually be consistent. But as a degradation solution, this approach can save a significant amount of server resources, which can be considered a good method.

Conclusion #

In this lesson, we learned how to mitigate traffic impact through architecture and design. Different scenarios require different techniques for splitting traffic.

We learned about managing user resource allocation using room allocation, dividing and buffering broadcast flooding interactions, avoiding peak loads for answer retrieval, merging multiple like actions on the client side, combining like traffic pressure with multiple service tree structures, and implementing sharding and scheduling for strong consistency.

Since different scenarios have different requirements for consistency, the resulting designs also vary.

To achieve a dynamically scalable high-concurrency live streaming system, we also need solid infrastructure support, including the following:

  • Distributed services: distributed queues, real-time computing, distributed storage.
  • Dynamic containers: server scheduling system, automated operations and maintenance, periodic stress testing, dynamically scalable services using Kubernetes.
  • Scheduling services: using services such as HttpDNS to temporarily adjust user traffic and dynamically allocate resources.

Thought Question #

Since CDNs can cache our static data, how do they recognize when our local static data is updated?

Feel free to discuss and exchange ideas with me in the comments section. See you in the next class!