38 Accounting System Design Ii How to Design the Unread System Below 50,000 Qps

38 Accounting System Design II - How to Design the Unread System Below 50,000 QPS #

Hello, I am Tang Yang.

In the previous lesson, I introduced how to design a general counting system that supports high concurrency access and stores large amounts of data. By using caching technology, message queue technology, and making deep modifications to Redis, we can support storing trillions of counts and processing millions of read requests per second. However, there is a special type of count that cannot be fully supported by the mentioned solutions, and that is the unread count.

The unread count is also a common module in a system. Taking a microblogging system as an example, you can see multiple scenarios where unread counts are used, such as:

When someone @mentions you, comments on your post, likes your post, or sends you a private message, you will receive corresponding unread notifications.
In earlier versions of microblogging platforms, there were system notifications, which means the system would send messages to all users to notify them of new versions or interesting marketing activities. If a user hasn’t read these messages, the system will show them the number of unread notifications.
When browsing the information feed, if you haven’t refreshed the page for a long time, there will be a prompt at the top of the feed showing the number of unread messages during that period.

So how do we record unread counts for the first requirement? In fact, this requirement can be implemented using the general counting system mentioned in the previous lesson, as the scenarios for both are very similar.

You can add a memory area in the counting system to store multiple unread counts, with the user ID as the key. When someone @mentions you, increase your unread @ count; when someone comments on your post, increase your unread comment count; and so on. When you click on the unread count to enter the notification page and view the messages @mentioning or commenting on you, reset these unread counts to zero. I believe that through the previous lessons, you are already very familiar with the design of this type of system, so I won’t elaborate further.

How are the unread counts for system notifications implemented? Can we use the general counting system? The answer is no because there are some problems that can arise.

How to Design Unread Count for System Notifications #

Let’s look at a specific example. If your system only has three users, A, B, and C, you can add a memory area in the general counter system and store the unread notification data of these three users using their user IDs as keys. When a new notification is sent, we increment the unread count for each user in a loop. Here’s the pseudocode for this logic:

List<Long> userIds = getAllUserIds();

for (Long id : userIds) {
  incrUnreadCount(id);
}

This solution may seem simple and feasible, but it has two fatal problems as the number of users in the system increases.

First, retrieving the full list of users is a time-consuming operation. It’s equivalent to scanning the user database once. Not only does it put a heavy load on the database, but the response time for querying the full user data is also very long, which is unacceptable for online businesses. If your user database is sharded, you would need to scan all the shards’ tables, making the response time even longer. However, there is a compromise. Before sending system notifications, you can retrieve the full list of user IDs from an offline data warehouse and store them in a local file. Then, you can iterate through all the user IDs to increment their unread count.

Although this seems like a viable technical solution, it would take a very long time to increment the unread count for everyone. Let’s do the math. If your system has one hundred million users, and incrementing the unread count for one user takes 1ms, it would take 100,000,000 * 1 / 1000 = 100,000 seconds, which is more than a day. Even if you start 100 threads to set the count concurrently, it would still take over ten minutes to complete, which is a long delay that users would have difficulty accepting.

In addition, using this approach requires storing an unread count value for every user in the system, but in a system, only a small percentage of users are active. Most users are inactive and may have never opened the system notifications. Therefore, recording unread counts for these users would be a waste.

From the above, you can see why we can’t use a general counter system to implement the unread count for system notifications, right? So what is the correct approach?

You need to understand that system notifications are actually stored in a large list that is shared by all users. However, each person sees a different set of messages, so each person will have a different unread count. Therefore, you can record the ID of the last message each person has read from this list and count how many messages there are after that ID. That would be the unread count.

There are several key points to consider when implementing this solution:

You need to set the unread count to 0 when a user visits the system notifications page. You should set the ID of the last notification they have seen to the ID of the latest system notification.
If the ID of the last notification seen is empty, it indicates a new user, so return an unread count of 0.
For inactive users, for example, users who haven’t logged in or used the system in the past month, you can clear the ID of the last notification they have seen to save memory space.

This is a relatively general solution that saves memory and minimizes the delay in retrieving the unread count. This solution is also applicable to another business scenario: full user marking, such as the red dot shown in the following screenshot from Weibo.

This red dot is similar to system notifications in that it is a way to notify all users. If you notify each user individually, the delay would be unacceptable. Therefore, you can use a similar solution to the one for system notifications.

First, we store a timestamp for each user, representing the most recent time they have clicked on the red dot. When a user clicks on the dot, you set this timestamp to the current time. We also maintain a global timestamp that represents the latest time the dot was marked. If you perform a background operation to mark the dot for all users, you update this timestamp to the current time. When determining whether to display the red dot, we only need to compare the timestamp of the user and the global timestamp. If the user’s timestamp is smaller than the global timestamp, it means that there have been new dots since the user last clicked on it, so the red dot should be displayed. Otherwise, the red dot should not be displayed.

The common feature of these two scenarios is that all users share a limited storage for data. Each person only keeps track of their own offset in this storage to obtain the unread count.

As you can see, the implementation of the unread count for system notifications is not very complicated. It avoids operations on the entire set of data for the unread count. If you have a similar red dot requirement in your system, I recommend using the above solution flexibly based on your actual work.

The last requirement to consider is the unread count for a microblog’s information flow. In today’s social systems, the follow relationship has become a standard feature, and an information flow based on the follow relationship is an important way to aggregate information. Therefore, designing an unread count system for information flows is a problem you must tackle.

How to design a scheme for the unread count of the feed #

The complexity of the unread count in the feed is mainly due to the following reasons.

First, the feed of Weibo is based on the relationship of following, and the unread count is also based on this relationship. In other words, if someone you follow posts a new Weibo, your unread count as a fan will increase by 1. If all Weibo users are like me, with only a few hundred fans as “little transparents,” it is simple - when you post a Weibo, the system can easily increase the unread count for your fans by 1. However, for some Weibo celebrities with millions or even tens of millions of fans, it becomes troublesome - increasing the unread count may take several hours. For example, if you are a fan of a Weibo celebrity like Yang Mi and want to see her real-time posts, it is unacceptable to receive the notification several hours after she has posted. Therefore, the delay of the unread count is the first thing you need to consider when designing the scheme.

Second, the demand for unread count requests in the feed is extremely high, with a high level of concurrency. This is because the interface is requested by client polling, not by user-triggering. In other words, even if the user opens the Weibo client and does nothing, this interface will still be requested. A few years ago, the number of requests for the unread count interface had already reached nearly 500,000 per second. In recent years, with the growth of Weibo, the number of requests has become even higher. As a non-core interface of Weibo, it is impossible for us to use a large number of machines to handle the unread count requests. Therefore, how to use limited resources to support such high traffic is the challenge of this scheme.

Lastly, unlike system notifications, the feed does not have shared storage. This is because each person’s followings are different, and thus the feed list is different. Therefore, the scheme of using system notifications for the unread count cannot be adopted.

So, how should you design a feed unread count system that can handle tens of thousands of requests per second? Here’s what you can do:

First, record the number of posts each user publishes in a general counter.

Then, record the snapshot of the number of posts from all the users that one person follows in Redis or Memcached. When the user clicks on the unread messages and resets the count to 0, refresh the snapshot with the number of posts from all the users that the user follows.

In this way, the difference between the total number of posts from all the users that the user follows and the total number of posts recorded in the snapshot will be the user’s unread count in the feed.

For example, let’s say User A follows User B, C, and D, as shown in the above diagram. User B has posted 10 Weibos, User C has posted 8 Weibos, and User D has posted 14 Weibos. In the most recent check of unread messages by User A, the number of posts in the snapshot for these three users was 6, 7, and 12 respectively. Therefore, User A’s unread count would be (10-6) + (8-7) + (14-12) = 7.

This scheme is simple to design and operates completely in memory, making it efficient enough to handle high concurrency. In fact, the Weibo team supports nearly 500,000 requests per second with just 16 ordinary servers, which demonstrates the excellent performance of this scheme. Thus, it fully meets the requirements of the unread count in the feed.

Of course, this scheme also has some drawbacks. For example, the snapshot needs to store the follow relationships, so if the relationships are not updated in a timely manner, it can lead to inaccuracies in the unread count. The snapshot uses full caching storage, so if the cache becomes full, some data will be evicted, resulting in the unread count for the evicted users becoming 0. However, fortunately, users have low requirements for the accuracy of the unread count (10 or 11 unread messages, users often can’t tell the difference), so these drawbacks are acceptable.

By sharing this case study on the design of an unread count system, I would like to give you some advice:

Caching is a magic tool for improving system performance and handling high concurrency. With a large-scale system like the Weibo feed, we can support it with just a dozen servers, thanks to caching.
Think of solutions for the key challenges in system design, just like we solved the delay issue with system notification for unread counts.
Analyze the business scenario rationally, clarify what can be compromised and what cannot. This will greatly improve the design of your system. For example, for users who have not logged in for a long time, we record their unread count as 0. Through such compromises, we can greatly reduce memory consumption and costs.

Summary of the Lesson #

The above is the content of this lesson. In this lesson, I introduced the design of the unread message system. The key points you need to understand are:

For one-to-one relationships like unread comments, @mentions, and likes, you can use the general counting method described in the previous lesson.

For scenarios with limited shared storage, such as system notifications and logging for all users, you can implement the unread solution by recording the user’s last operation time or offset.

Lastly, the unread solution for the information flow is the most complex, using the method of recording a snapshot of the user’s blog count.

As you can see, although these three types of requirements are all related to unread messages, they have different scenarios and requirements in terms of scale. Therefore, as I just mentioned, when designing a solution, you need to analyze the scenario of the requirement, such as the data scale and request volume, and see if there are any characteristics that can be utilized (such as limited shared storage in the system notification scenario or limited number of followers in the information flow scenario). Based on that, you can develop a targeted solution. Avoid blindly applying previous experience to different scenarios, as it may lead to performance degradation or even endanger the stability of the system.