39 Information Stream Design I How to Do the Push Model in a General Information Stream System

39 Information Stream Design I - How to Do the Push Model in a General Information Stream System #

Hello, I’m Tang Yang.

In the previous two lessons, I discussed with you how to design and implement a common module in Internet systems - the counting system. Its business logic is actually very simple, with at most three interfaces: get count, increment count, and reset count. So when considering a solution, there are relatively few aspects to consider, and basically using caching can achieve a solution that balances performance, availability, and robustness. However, the logic of large-scale business systems can be very complex, and when designing a solution, it usually requires flexible use of various technologies to handle high concurrency and high traffic. So next, I will show you how to design the most complex and highest concurrency information feed system in a community system. In this way, you can understand how to apply the components you previously learned.

The earliest information feed system originated from microblogs. We know that microblogs distribute content based on the following relationship, which means that if user A follows user B, user A needs to see user B’s latest content in their own information feed in real-time. This is the basic logic of a microblog system, and it is also the key to enabling fast circulation and propagation of information. Since microblog feeds are usually displayed in reverse chronological order, we commonly refer to the information feed system as a TimeLine. So what points should we consider when designing an information feed system?

What are the focus points in designing an information flow system? #

Firstly, we need to focus on real-time data. This means that when someone you follow posts a Weibo message, it should appear in your information flow within a short period of time.

Secondly, we need to consider how to support high concurrent access. The information flow is the main module of Weibo, and it is the first module that users see when they enter Weibo. Therefore, it has the highest concurrent request volume, which can reach hundreds of thousands of requests per second.

Lastly, the performance of information flow retrieval directly affects user experience. There is a lot of data that needs to be aggregated in the Weibo information flow system. If you open the client and think about which data needs to be aggregated, it mainly includes Weibo data, user data, and other information such as whether the Weibo has been liked, commented on, forwarded, or followed. Aggregating so much data requires multiple queries to cache, databases, and counters. How can we ensure that these query operations are completed and the Weibo information flow is displayed within 100ms under hundreds of thousands of requests per second? This is the most complex aspect of the Weibo information flow system and the biggest technical challenge.

So, how do we design an information flow system that can support high concurrency and large traffic? Generally, there are two approaches: one is based on push mode, and the other is based on pull mode.

How to implement an information flow system based on push mode #

What is push mode? Push mode refers to the practice of actively pushing a Weibo (a Chinese microblogging platform) post to followers after a user sends it, thus achieving the distribution of Weibo posts and the aggregation of Weibo information flow.

If we consider the Weibo system as an email system, the Weibo posts sent by users can be regarded as entering a “sent” mailbox, and the user’s information flow can be regarded as their “inbox”. In push mode, when a user publishes a Weibo post, in addition to writing the post to their own “sent” mailbox, a copy of the Weibo post is also written to the “inbox” mailboxes of their followers.

Suppose User A has three followers, B, C, and D. If we express what the system does when User A publishes a Weibo post using SQL statements, it would look like the following:

insert into outbox(userId, feedId, create_time) values("A", $feedId, $current_time); // Write to A's "sent" mailbox

insert into inbox(userId, feedId, create_time) values("B", $feedId, $current_time); // Write to B's "inbox"

insert into inbox(userId, feedId, create_time) values("C", $feedId, $current_time); // Write to C's "inbox"

insert into inbox(userId, feedId, create_time) values("D", $feedId, $current_time); // Write to D's "inbox"

When we want to query the information flow of User B, we only need to execute the following SQL statement:

select feedId from inbox where userId = "B";

If you want to improve the performance of reading the information flow, you can store the data of the “inbox” in a cache, and retrieve it directly from the cache each time you need to access the information flow.

Problems and solutions of push mode #

You see, according to this approach, a complete Weibo information flow system can be implemented, which also conforms to our common sense. However, there are some problems with this solution.

Firstly, there is message delay. As we mentioned when discussing system notifications of unread messages, we cannot use the method of iterating through all users and adding unread messages to them, because iterating through all users once has a high latency, and push mode has the same problem. For celebrities, they have a large number of fans. If they have to write the Weibo into the inbox of tens of millions of people while sending the Weibo, the response time of sending Weibo will be very slow, and users will not be able to accept it. Therefore, we usually use message queues to eliminate the peak of writing, but even so, because there are too many messages written into the inbox, you may still see the content posted by celebrities several hours later, which greatly affects the user experience.

img

In push mode, what you need to focus on is the writing performance of Weibo because multiple database writes will be generated for each Weibo posted by users. In order to minimize the delay of Weibo writing, we can ensure it from two aspects.

On the one hand, in message processing, you can start multiple threads to process Weibo writing messages in parallel.

On the other hand, since the message flow can use cache to improve reading performance, we should try our best to ensure the performance of data writing to the database, and if necessary, use database storage engines with better writing performance.

For example, when I was at NetEase Weibo, we used push mode to implement the Weibo information flow. At that time, in order to improve the insert performance of the database, we used TokuDB as the storage engine for MySQL. The core of this engine architecture is an index structure called the Fractal Tree Indexes. We know that when the database writes data, it will cause random writes to the disk, resulting in disk seeking, which affects the performance of data writing. The Fractal Tree structure, similar to LSM mentioned in Lecture 11, can transform random writes into sequential writes, thus improving the writing performance. In addition, compared with InnoDB, TokuDB has better data compression performance. According to official tests, TokuDB can compress 4TB of data stored in InnoDB to 200G. This is also a great benefit for businesses with large amounts of written data. However, compared with InnoDB, TokuDB’s deletion and query performance are slightly worse, but you can use cache to accelerate query performance, and the deletion frequency of Weibo is not high, so this has limited impact on the message flow in push mode.

Secondly, the storage cost is high. In this solution, we generally design the table structure like this: First, design a Feed table that primarily stores basic information about microblogs. This includes microblog ID, creator ID, creation time, microblog content, microblog status (deleted or normal), and so on. It uses the microblog ID for sharding and partitioning.

Another table is the user’s outbox and inbox table, also known as the Timeline table. It has three fields: user ID, microblog ID, and creation time. It uses the user ID for sharding and partitioning.

img

Since the push mode requires maintaining a copy of the inbox data for each user, the amount of data storage is enormous. Just imagine, Xie Na’s fans have exceeded 120 million currently. So, if we adopt the push mode, every time Xie Na sends a microblog, it will generate over 120 million rows of data. How terrifying! Our solution is: besides choosing a storage engine with a higher compression rate, we can also periodically clean up the data. Microblog data has a obvious temporal nature, and users are more interested in recently posted data, typically not reviewing microblogs from a long time ago. Therefore, you can periodically clean up users’ inboxes, such as only retaining data from the past month.

In addition, scalability issues are often encountered in the push mode. In Weibo, there is a grouping feature that allows you to categorize the people you follow into different groups. For example, you can categorize the people you follow into “Celebrities”, “Technology”, “Travel”, etc. and group Yang Mi into the “Celebrities” category, and put InfoQ in the “Technology” category. So, after introducing grouping, what impact does it have on the push mode? First, a user has more than one inbox. For example, there is a global inbox, and a separate inbox for each group. After a microblog is published, it also needs to be copied to more inboxes.

If Yang Mi posts a microblog, it needs to be inserted not only into my inbox but also into my “Celebrities” inbox. This not only increases the pressure of message distribution but also increases the storage cost because each inbox needs to be stored separately.

Finally, handling unfollowing and deleting microblogs becomes more complicated. For example, when Yang Mi deletes a microblog, if we want to delete this microblog from the inboxes of all her fans, it would bring additional distribution pressure. Therefore, it is better not to do this.

And if you unfollow someone, you need to delete all their microblogs from your inbox. If they have posted a large number of microblogs, even if you don’t log in for a long time, you would still need to perform a large number of deletion operations on your inbox, which is not worth it. So, the strategy you can adopt is: when reading your own feed, check whether each microblog has been deleted and whether you are still following the author of that microblog. If not, then don’t display the content of that microblog. After using this strategy, you can minimize unnecessary write operations to the database.

So, after saying all this, what kind of business scenarios is the push mode suitable for? In my opinion, it is more suitable for scenarios where a user has a relatively limited number of followers, such as WeChat Moments. You can think of it as when I add a friend on WeChat, I follow them and they follow me back, so the upper limit for friends is also the upper limit for followers (which is usually 5000 in Moments). A limited number of followers ensures that messages can be pushed to all followers as quickly as possible, and the additional storage cost is relatively limited. If the number of followers is limited in your business, then when implementing an information feed based on follower relationships, you can also use the push mode to implement it.

Summary of the Lesson #

The above is the content of this lesson. In this lesson, I introduced the solution of implementing information flow through the push mode, as well as the problems and solutions that this mode may have. The key points you need to understand are:

The push mode is when a user sends a Weibo post, it is actively written into the inbox of their followers.

The main problems with push mode are the delay of pushed messages, storage costs, scalability of the solution, and special handling for unfollowing and deleting Weibo posts.

Push mode is more suitable for scenarios with a limited number of followers.

As you can see, push mode is not suitable for platforms like Weibo, which have tens of millions of followers, because the high message push latency and storage costs are difficult to accept. Therefore, we either use the pull mode to implement, or use a combination of push and pull mode. How are these two solutions implemented? What pitfalls do they have in implementation? And how to solve them? I will focus on introducing them in the next lesson.