40 Information Stream Design Ii How to Do the Pull Model in a General Information Stream System

40 Information Stream Design II - How to Do the Pull Model in a General Information Stream System #

Hello, I’m Tang Yang.

In the previous lesson, I introduced how to implement an information flow system using the push mode. From that, you should have understood the problems that arise with the push mode, such as message delay, high storage costs, and poor scalability when dealing with a large number of followers. Although we can take measures to address these issues, such as selecting a high-performance database storage engine to improve data writing speed and periodically deleting cold data to reduce storage costs, the push mode can only alleviate the delay issue for these users’ Weibo push notifications, without solving it completely.

You may now wonder: Is there a solution that can permanently solve this problem? Of course, there is. You may want to try using the pull mode to implement the Weibo information flow system. So, how exactly can we do that?

How to Design an Information Flow System Using the Pull Model #

The so-called pull model refers to the method of generating information flow data by allowing users to actively retrieve microblogs from people they follow, sorting and aggregating these microblogs in reverse chronological order.

When implementing a microblog information flow system based on this approach, you will find that the user’s inbox is no longer useful because the information flow data no longer originates from the user’s inbox, but from the outbox. The outbox contains the aggregation of data from all the people the user follows. Therefore, when a user wants to post a microblog, they only need to write it to their outbox, without the need to push it to the follower’s inbox. Thus, when retrieving the information flow, the data from the outbox needs to be queried.

Let me express this logic in SQL form to make it easier for you to understand. Suppose User A follows Users B, C, and D. When User B sends a microblog, they perform the following operation:

insert into outbox(userId, feedId, create_time) values("B", $feedId, $current_time); // Write to B's outbox

When User A wants to retrieve their information flow, they need to aggregate the content of the inboxes of Users B, C, and D:

select feedId from outbox where userId in (select userId from follower where fanId = "A") order by create_time desc

As you can see, the implementation of the pull model is not complicated, and it has several obvious advantages compared to the push model.

Firstly, the pull model completely solves the problem of push delay. High-profile users no longer need to push their microblogs to the followers’ inboxes, so there is no longer a delay issue.

Secondly, storage cost is greatly reduced. In the push model, if a celebrity has 120 million followers, their microblog needs to be replicated 120 million times and written to the storage system. In the pull model, only the outbox needs to be maintained, and the microblog data no longer needs to be replicated, thus reducing the cost.

Finally, it provides better scalability. For example, if the microblog platform adds the feature of grouping, and you want to group Users A and B into a separate group, the microblogs posted by Users A and B will form a new information flow. How can this information flow be implemented? It is simple, you just need to query all the users in this group (i.e., Users A and B), then query the outbox of these users, and finally, sort and aggregate the data in the outboxes according to the time in descending order:

List<Long> uids = getFromGroup(groupId); // Get all the users in the group

List<Long<List<Long>>> ids = new ArrayList<List<Long>>();

for(Long id : uids) {

  ids.add(getOutboxByUid(id)); // Get the content ID list from the outbox

}

return merge(ids); // Merge and sort all the IDs

Although the pull mode can solve all the problems of the push mode in the business aspect due to the limit of the number of followings, it is not an impeccable solution. In my opinion, there are mainly two issues with the pull model.

Firstly, unlike the push model, where retrieving the information flow simply involves querying the inbox data, the pull model requires aggregating the data from multiple outboxes, resulting in higher query and aggregation costs. The maximum number of followings on a microblog platform is 2000. If you follow those 2000 people, you have to query the microblog information posted by those 2000 people and then aggregate the queried information.

So, how can we ensure that these information queries and aggregations can be completed in milliseconds? The answer is caching. We can store the IDs of the microblogs posted by users in the cache. However, caching all the microblogs of all the users would also incur high hardware costs. Therefore, we need to analyze the browsing behavior of users to see if we can optimize the storage cost of the cache.

In practice, we have analyzed the browsing behavior of users and found that 97% of users only browse microblogs posted in the last 5 days. This means that users rarely look back at microblog content older than five days. Therefore, we only cache the microblog IDs posted by each user in the last 5 days. Let’s say we deploy 6 cache nodes to store these microblog IDs. During each aggregation, we parallelly query the cache nodes for the microblog IDs of multiple users, sort them in the application server’s memory, and return the result within 5 milliseconds. This involves 6 cache requests and ensures that the results are returned within 5 milliseconds.

Secondly, the bandwidth cost of the cache nodes is relatively high. Let’s say the microblog information flow has a traffic volume of 100,000 requests per second, meaning that each cache node will be queried 100,000 times per second. Assuming we deploy a total of 6 cache nodes and the average number of followings per user is 90, each cache node needs to store the data of 15 users. If each person on average posts 2 microblogs per day, in 5 days there will be 10 microblogs. So, 15 users need to store 150 microblog IDs. If each microblog ID is 8 bytes, 150 microblog IDs will roughly be 1 kB of data. The bandwidth for a single cache node will be 1 kB * 100,000 = 100 MB, which basically saturates the machine’s network card bandwidth. So, how can we optimize the cache bandwidth?

As I mentioned in Lecture 14, deploying multiple cache replicas improves cache availability. In fact, cache replicas can also share the bandwidth load. After deploying cache replicas, requests will first query the data in the replicas, and only requests that miss the replicas will query the data in the main cache. Suppose the original cache had a bandwidth of 100 MB, and we deploy four sets of cache replicas, with a cache hit rate of 60% for each replica. This means that the bandwidth of the main cache is reduced to 100 MB * 40% = 40 MB, and the bandwidth of each cache replica is 100 MB / 4 = 25 MB. This way, the bandwidth of each cache set falls within an acceptable range.

After optimizing the above aspects, the basic design of an information flow system based on the pull model is almost complete. You can refer to this solution when designing your own information flow system. Additionally, using cache replicas to handle traffic is a common cache design approach that you can also consider using when necessary in your projects.

What is the solution for combining push and pull modes? #

However, some students may say: I have already implemented an information flow system based on the push mode during the initial stage of system construction. It would be too costly to rebuild the system using the pull mode. Is there a compromise solution based on the push mode?

Actually, when I was working at NetEase Weibo, the information flow of NetEase Weibo was implemented based on the push mode. After the number of followers for a user increased significantly, we carried out modifications to the original system and implemented a solution that combines both push and pull modes, which can essentially solve the problems of the push mode. So, how did we do it?

The core of the solution lies in the fact that when a major V user posts a Weibo, it is no longer pushed to all users, but only to active users. There are several key points to pay attention to when implementing this solution.

Firstly, how do we determine who are the major V users? Or in other words, which users need to push to all users when they send out a Weibo, and which users should only push to active users? In my opinion, the number of followers should be used as the criterion. For example, if the number of followers exceeds 500,000, it can be considered as a major V user and only needs to push to active users.

Secondly, how do we mark active users? Active users can be defined as users who have performed some actions on Weibo in the past few days, such as refreshing the information flow, posting Weibo, forwarding or commenting on Weibo, or following other users. Once a user has performed these actions, we mark him/her as an active user.

As for major V users, we can store a list of active followers, which consists of the active users marked by us. When a user transitions from an inactive state to an active state, we will check which users are major V users among the followers of this user, and then write this user into the list of active followers for these major V users. This list of active followers has a fixed length, and if the number of active followers exceeds the length, the earliest added followers will be removed from the list. This ensures the efficiency of the push.

Finally, when a user is removed from the list of active followers, or when a user transitions from inactive to active, since he/she is not in the list of active followers for major V users, he/she will not receive real-time push of the Weibo. Therefore, we need to asynchronously add the recently released Weibo of major V users to their inboxes, ensuring the integrity of their information flow data.

Using a combination of push and pull modes can to some extent compensate for the shortcomings of the push mode. However, it also brings some extra maintenance costs, such as the need for the system to maintain user’s online status and additionally maintaining a set of data for active followers, which increases the storage cost.

Therefore, this method is generally suitable for medium-sized projects. When the number of followers is around one million and the number of active followers is around 100,000, it is possible to achieve relatively low information dissemination delay and information flow acquisition delay. However, as the number of followers continues to increase and traffic keeps rising, both the storage of active followers and the delay in push will become bottlenecks. Therefore, it is better to switch to the pull mode to support the business.

Summary of the Course #

That’s all for this lesson. In this lesson, I have introduced a solution for implementing an information flow system based on the pull mode and the combined push-pull mode. Here are a few key points you need to understand:

In the pull mode, we only need to store the user’s outbox, and the user’s information flow is implemented by aggregating the outbox data of followers.
The pull mode will have a relatively large aggregation cost, and the caching nodes will also have bandwidth bottlenecks. Therefore, we can reduce the size of the data to be retrieved and deploy cache replicas through some trade-off strategies to resist concurrency as much as possible.
The core of the combined push-pull mode is to only push to active fan users. It requires maintaining the online status of users and the list of active fans, so it will incur additional storage costs, which you need to weigh.

The pull mode and the combined push-pull mode are more suitable for business scenarios with a large number of fans, such as Weibo, because they both have a relatively controllable message push delay. As you can see, in these two lessons, we flexibly use technologies such as database sharding, cache message queues, and ID generators to implement information flow systems based on push mode, pull mode, and combined push-pull mode. When designing your own system solution, you should fully leverage the advantages of each technology, weigh the characteristics of the business itself, and ultimately achieve a balance between technology and business, ensuring both user requirements and high performance and high availability of the system.