12 One Billion Keys Which Collection Should Be Used for Counting

12 One Billion keys Which Collection Should be Used for Counting #

In the business scenarios of web and mobile applications, we often need to store information where a key corresponds to a collection of data. Let me give you a few examples.

User login information in a mobile app: Each day corresponds to a series of user IDs or mobile device IDs.
User comment list for products on an e-commerce website: Each product corresponds to a series of comments.
Check-in information for users on a mobile app: Each day corresponds to a series of check-in records of users.
Web page visit information on an application website: Each webpage corresponds to a series of click events.

As we know, the characteristic of Redis set type is that a key corresponds to a series of data, which makes it very suitable for storing and accessing this type of data. However, besides storing the information, we often need to perform data statistics on the collections in these scenarios. For example:

In a mobile application, we need to count the number of new users each day and the number of users retained the next day.
In the product comments on an e-commerce website, we need to count the latest comments in the comment list.
In check-in records, we need to count the number of users who checked in continuously within a month.
In web page visit records, we need to count the number of unique visitors (UV).

In general, we face large numbers of users and high traffic volume, such as millions or tens of millions of users, or even billions of access records. Therefore, we must choose a collection type that can efficiently handle a large amount of data (e.g., billions) for statistics.

To choose the appropriate collection type, we need to understand the commonly used collection statistical patterns. In this lesson, I will introduce to you four common statistical patterns for collection types, including aggregation statistics, sorted statistics, binary state statistics, and cardinality statistics. I will discuss with you these statistical patterns using the four scenarios mentioned earlier, explaining which collection types can perform statistics more efficiently and also save memory space. Once you have mastered today’s content, you will be able to quickly select the appropriate collection type when encountering collection element statistical problems in the future.

Aggregation Statistics #

Let’s first look at the first scenario of aggregate statistics.

Aggregation statistics refer to the statistical results of multiple set elements, including: statistics of common elements in multiple sets (intersection statistics); comparison of two sets to statistically obtain the elements unique to one set (difference statistics); statistics of all elements in multiple sets (union statistics).

In the scenario mentioned earlier, the statistics of the daily new user count of a mobile app and the count of retained users on the second day correspond to aggregate statistics.

To complete this statistical task, we can use a set to record the user IDs of all users who have logged into the app. At the same time, another set is used to record the user IDs of users who have logged into the app each day. Then, we can perform aggregate statistics on these two sets. Let’s take a closer look at the specific operations.

It is relatively simple to record the user IDs of all users who have logged into the app. We can use the Set data type directly, with the key set as user:id, indicating that it records the user ID, and the value is a Set collection that contains the user IDs of all users who have logged into the app. We can call this set the cumulative user Set, as shown in the figure below:

It is important to note that the cumulative user Set does not contain date information, so we cannot directly calculate the number of new users each day. Therefore, we also need to record the user IDs of users who log in each day in a new set. We call this set the daily user Set, which has two characteristics:

- The key is user:id and the current date, for example, user:id:20200803;

- The value is a Set collection that records the user IDs logged in that day.

When calculating the number of new users each day, we only need to calculate the difference between the daily user Set and the cumulative user Set.

I will explain with a concrete example.

Suppose our mobile app was launched on August 3, 2020, so there are no users before August 3. At this time, the cumulative user Set is an empty set, and the user IDs of users who log in on that day will be recorded in the Set with the key user:id:20200803. Therefore, the user IDs in the Set user:id:20200803 are the new users of that day.

Then, we calculate the union of the cumulative user Set and the Set user:id:20200803, and the result is saved in the cumulative user Set with the key user:id, as shown below:

SUNIONSTORE  user:id  user:id  user:id:20200803

At this time, the cumulative user Set user:id contains the user IDs of August 3. When counting on August 4, we record the user IDs of users who log in on August 4 in the Set user:id:20200804. Next, we execute the SDIFFSTORE command to calculate the difference between the cumulative user Set and the Set user:id:20200804. The result is saved in the Set with the key user:new, as shown below:

SDIFFSTORE  user:new  user:id:20200804 user:id

As you can see, the user IDs in this difference set exist in the Set user:id:20200804, but not in the cumulative user Set. So, the Set user:new records the new users on August 4.

When calculating the retained users on August 4, we only need to calculate the intersection of the two Sets user:id:20200803 and user:id:20200804. This will give us the user IDs that exist in both sets, which are the users who logged in on August 3 and were retained on August 4. The command executed is as follows:

SINTERSTORE user:id:rem user:id:20200803 user:id:20200804

When you need to perform aggregate calculations on multiple sets, the Set data type is a very good choice. However, I want to remind you that there is a potential risk here.

The calculation complexity of set difference, union, and intersection is relatively high. In the case of large amounts of data, directly executing these calculations will cause the Redis instance to block. Therefore, I will share a small suggestion with you: You can select a slave from the master-slave cluster and let it specialize in aggregate calculations, or you can read the data to the client and complete the aggregate statistics on the client side. This way, you can avoid the risk of blocking the main and other slave Redis instances.

Sorting and Statistics #

Next, let’s talk about methods to handle sorting requirements for sets. I will use the scenario of providing a list of the latest comments on an e-commerce website to explain.

The latest comment list contains the most recent messages among all comments, which requires the set elements to be ordered, meaning that the elements in the set can be arranged in a specific order. This type of set that maintains the order of elements is called an ordered set.

Among the 4 commonly used set types in Redis (List, Hash, Set, Sorted Set), List and Sorted Set belong to ordered sets.

List is sorted based on the order in which elements are added to the list, while Sorted Set can be sorted based on the “weight” of the elements, and we can decide the weight value for each element. For example, we can determine the weight value based on the time when the element is inserted into the Sorted Set, where elements inserted earlier have a smaller weight value, and elements inserted later have a larger weight value.

Both options seem to satisfy the requirement, so how do we choose?

Let’s first talk about using List. Each product corresponds to a List, which contains all the comments for that product, and these comments are saved in the List in the order of the time they are posted. Whenever a new comment is posted, we use the LPUSH command to insert it at the front of the List.

When there is only one page of comments, we can clearly see the latest comments. However, in practical applications, websites usually display the latest comments in pages. Once pagination is involved, List may encounter some problems.

Assume that the current comment List is {A, B, C, D, E, F} (where A is the latest comment, and so on, and F is the earliest comment), we can use the following command to retrieve the three latest comments, A, B, C, for the first page:

LRANGE product1 0 2
1) "A"
2) "B"
3) "C"

Then, we can use the following command to retrieve the next three comments, D, E, F, for the second page:

LRANGE product1 3 5
1) "D"
2) "E"
3) "F"

However, let’s say that before displaying the second page, a new comment G is posted, and the LPUSH command inserts comment G at the front of the comment List, so the List becomes {G, A, B, C, D, E, F}. At this point, when using the above command to retrieve the second page of comments, we would find that comment C is displayed again, that is, C, D, E.

LRANGE product1 3 5
1) "C"
2) "D"
3) "E"

The key reason for this is that List is sorted based on the position of the elements in the List. When a new element is inserted, the positions of the existing elements in the List are shifted forward by one position. For example, an element originally at position 1 is now at position 2. Therefore, compared to before the new element is inserted, the elements in the same positions of the List will change, and when using LRANGE to read the List, we will read the old elements.

Compared to List, Sorted Set does not have this problem because it sorts and retrieves data based on the actual weight of the elements.

We can assign a weight value to each comment based on its posting time, and then save the comments in a Sorted Set. The ZRANGEBYSCORE command of Sorted Set can return the elements in sorted order based on their weight values. In this way, even if the elements in the set are frequently updated, Sorted Set can accurately retrieve data in the desired order using the ZRANGEBYSCORE command.

Let’s assume that the newer the comment, the greater the weight value. If the weight value for the latest comment is N, we can use the following command to obtain the 10 latest comments:

ZRANGEBYSCORE comments N-9 N

Therefore, when facing scenarios that require displaying the latest lists or rankings, and when the data is frequently updated or pagination is needed, it is recommended to consider using Sorted Set.

Bit-wise State Statistics #

Now let’s analyze the third scenario: bit-wise state statistics. Here, bit-wise state refers to a collection of elements with only two possible values: 0 and 1. In the scenario of checking in and clocking in, we only need to record whether the check-in happened (1) or not (0), so it is a typical case of bit-wise state.

When calculating check-in statistics, one bit is enough to represent one user’s daily check-in. For a month (let’s assume it’s 31 days), 31 bits are sufficient to track the check-in status. Similarly, for a whole year, only 365 bits are needed. In summary, there is no need for complicated data structures. This is where Bitmap comes in. Bitmap is an extended data type provided by Redis. Let me explain how it works.

Bitmap itself is implemented as a string type, which is stored as a binary byte array. Redis leverages each bit in the byte array to represent the bit-wise state of an element. You can think of Bitmap as a bit array.

Bitmap offers GETBIT/SETBIT operations, which read and write a specific bit in the bit array using an offset value. However, it’s important to note that the offset in Bitmap starts from 0, meaning the minimum value of offset is 0. When using SETBIT to write to a bit, the bit is set to 1. Bitmap also provides the BITCOUNT operation, which counts the number of “1"s in the bit array.

Now, let’s see how Bitmap can be used for check-in statistics. I will provide an example to illustrate the steps.

Assume we want to track the check-in status of user ID 3000 in August 2020. Here are the steps:

Execute the following command to record that the user checked in on August 3rd:

SETBIT uid:sign:3000:202008 2 1

Check if the user checked in on August 3rd:

GETBIT uid:sign:3000:202008 2

Count the number of check-ins by the user in August:

BITCOUNT uid:sign:3000:202008

With these steps, we can know the check-in status of the user in August. Simple, right? Now, let’s consider another question: If we record the check-in status of 100 million users for 10 days, can you find the total number of users who checked in continuously for those 10 days?

Before introducing a specific method, we need to understand that Bitmap supports bitwise operations like AND, OR, and XOR on multiple Bitmaps. The result of the operation is stored in a new Bitmap.

I will use the bitwise AND operation as an example to explain. In the example below, we have three Bitmaps, bm1, bm2, and bm3. We perform a bitwise AND operation on the corresponding bits of the three Bitmaps and save the result in a new Bitmap (in the example, this result Bitmap is named “resmap”).

Image: Bitmap AND operation

Now, let’s get back to the question we had before - counting the check-in status of 100 million users for 10 consecutive days. You can use each day’s date as the key and create a 100 million-bit Bitmap for each day. Each bit in the Bitmap corresponds to a user’s check-in status for that day.

Next, perform the AND operation on the 10 Bitmaps to get the result, which is another Bitmap. In this Bitmap, the bit positions that correspond to users who checked in for all 10 days will have a value of 1. Finally, you can use BITCOUNT to count the number of “1"s in the Bitmap, which gives you the total number of users who checked in continuously for 10 days.

Now, let’s calculate the memory overhead of recording the check-in status for 10 days. Since we use a 100 million-bit Bitmap for each day, it roughly occupies 12MB of memory (10^8/8/1024/1024). The memory overhead for the Bitmaps of 10 days is approximately 120MB, which is not too high. However, in practical applications, it’s recommended to set an expiration time for the Bitmaps so that Redis automatically deletes unnecessary check-in records to save memory.

Therefore, if you only need to track the binary state of data, such as whether a product exists or if a user is present, you can use Bitmap because it only requires one bit to represent 0 or 1. When dealing with massive amounts of data, Bitmap can effectively save memory space.

Cardinality Estimation #

Finally, let’s take a look at another statistical scenario: cardinality estimation. Cardinality estimation refers to counting the number of unique elements in a set. In the context we introduced earlier, it refers to counting the Unique Visitors (UV) of a webpage.

There is a unique aspect to the counting of webpage UV, which is the need for deduplication. Multiple visits from a user within one day should only be counted as one visit. In Redis, the Set data structure supports deduplication by default, so when we encounter a deduplication requirement, we may immediately think of using the Set data type.

Let’s consider an example using the Set data type.

When user1 visits page1, you can add this information to the Set using the following command:

SADD page1:uv user1

When user 1 visits again, the deduplication feature of the Set ensures that the visit from user 1 is not recorded again. In this way, user 1 is counted as a unique visitor. When you need to count the UV, you can directly use the SCARD command, which returns the number of elements in a set.

However, if page1 becomes extremely popular and the UV reaches millions, then a Set would need to store millions of user IDs. For an e-commerce website running large promotions, there may be thousands of such pages, and if each page uses a Set like this, it would consume a significant amount of memory.

Alternatively, you can use the Hash data type to record the UV.

For example, you can use the user ID as the key in a Hash data structure. When a user visits a page, you can use the HSET command (used to set the value of a Hash data structure element) to record a value “1” for that user ID, indicating a unique visitor. For instance, after user 1 visits page1, you can record it as one unique visitor using the following command:

HSET page1:uv user1 1

Even if user 1 visits the page multiple times, executing the HSET command multiple times will only set the value for user1 as 1, still counting it as one unique visitor. When counting the UV, you can use the HLEN command to count the total number of elements in the Hash data structure.

However, similar to the Set data type, when there are many pages, the Hash data type would also consume a significant amount of memory. So, is there a way to achieve the counting and save memory at the same time?

This is where HyperLogLog, provided by Redis, comes in.

HyperLogLog is a data structure used for cardinality estimation, and its main advantage lies in the fact that when there are a large number of elements in the set, the space required to calculate the cardinality is always fixed and very small.

In Redis, each HyperLogLog only requires 12 KB of memory and can estimate the cardinality of nearly 2^64 elements. As you can see, compared to the Set and Hash data types that consume more memory as the number of elements increases, HyperLogLog is very memory-efficient.

When counting the UV, you can use the PFADD command (used to add new elements to a HyperLogLog) to add each user who visits the page to the HyperLogLog.

PFADD page1:uv user1 user2 user3 user4 user5

Next, you can directly obtain the UV value of page1 using the PFCOUNT command, which returns the estimation result of the HyperLogLog.

PFCOUNT page1:uv

Regarding the specific implementation principle of HyperLogLog, you don’t need to focus on it as it does not affect your daily usage. If you’re interested, you can refer to this link in your spare time.

However, there is one important thing to note: the cardinality estimation of HyperLogLog is based on probabilities, so the estimation result it provides has some margin of error, with a standard error rate of 0.81%. This means that if you use HyperLogLog to estimate a UV of 1 million, the actual UV may be 1.01 million. Although the error rate is not high, if you need precise statistical results, it is better to continue using the Set or Hash data types.

Summary #

In this lesson, we learned about four typical statistical modes of collection types: aggregation statistics, sorting statistics, binary state statistics, and cardinality statistics. We combined four typical scenarios: statistics of new user count and retained user count, latest comment list, user check-in count, and independent visitors count to understand these statistical modes. To help you grasp the information easily, I have summarized the support and advantages and disadvantages of Set, Sorted Set, Hash, List, Bitmap, and HyperLogLog in the following table. I hope you can save this table and review it from time to time.

As we can see, both Set and Sorted Set support multiple aggregation statistics. However, only Set supports difference set calculation. Bitmap can perform multiple aggregation calculations between bitmaps, including AND, OR, and XOR operations.

When sorting statistics are needed, although the elements in the List are ordered, once a new element is inserted, the original position of the elements in the List will change. Therefore, the sorting result obtained by reading according to the position may not be accurate. On the other hand, Sorted Set itself is sorted based on the weight of the collection elements, so it can accurately obtain the results in order. Therefore, I recommend you to prioritize using Sorted Set.

If the data we record only has two values, 0 and 1, Bitmap is a good choice mainly because it only uses 1 bit to record one data, which can save memory.

For cardinality statistics, if the number of collection elements reaches the level of billions and accurate statistics are not required, I recommend you to use HyperLogLog.

Of course, Redis has many application scenarios, and the summary in this table may not cover all scenarios. I suggest you try to draw your own table and add other scenarios you have encountered. With long-term accumulation, you will surely be able to apply collection types more flexibly to suitable practical projects.

One Question per Lesson #

As usual, I have a small question for you. In this lesson, we learned about 4 typical statistical patterns, as well as the support and pros and cons of various collection types. I would like to know if you have encountered any other statistical scenarios? What collection types did you use?

Please feel free to share your thoughts and answers in the comments section, and let’s discuss. If you have friends or colleagues who need help with these statistical problems, please feel free to share today’s content with them. See you in the next lesson.