22 Outstanding Cardinality Statistics Algorithm Hyper Log Log

22 Outstanding Cardinality Statistics Algorithm HyperLogLog #

Why Use HyperLogLog? #

In the process of our actual development, we may encounter a problem: what type should we use to count the number of unique website visits for a large website?

If we use sets in Redis for counting, it will be a huge problem when it has millions of visits per day. Because these visit counts cannot be cleared, our operations staff may need to check this information at any time. Over time, the space occupied by these statistics data will become larger and larger, gradually exceeding the maximum space we can bear.

For example, if we use IP as the criteria for determining unique visits, we have to store each unique IP address. To calculate with IPv4, we need 15 bytes to store information for each IPv4 address, such as “110.110.110.110”. When there are 10 million unique IP addresses, the space occupied will be 15 bits * 10,000,000, which is around 143MB. But this is just the statistics for one page. If we have 10,000 pages like this, we need more than 1TB of space to store these data. Moreover, with the popularity of IPv6, this storage size will become larger and larger. In this case, we cannot use sets to store the data anymore. We need to develop a new data type, HyperLogLog, to do this.

Introduction to HyperLogLog #

HyperLogLog (referred to as HLL below) is a data structure added in Redis 2.8.9 version. It is used for high-performance cardinality (distinct counting) statistics and has the disadvantage of having an extremely low error rate.

HLL has the following characteristics:

It can use very little memory to count a large amount of data. It only needs 12KB of space to count 2^64 data.
There is a certain error in the counting, but the error rate is overall very low, with a standard error of 0.81%.
The error can be reduced by setting auxiliary calculation factors.

Basic Usage #

HLL has only three commands, but they are very practical. Let’s take a look at each one below.

Add elements #

127.0.0.1:6379> pfadd key "redis"
(integer) 1
127.0.0.1:6379> pfadd key "java" "sql"
(integer) 1

Syntax:

pfadd key element [element ...]

This command supports adding one or more elements to the HLL structure.

Count unique elements #

127.0.0.1:6379> pfadd key "redis"
(integer) 1
127.0.0.1:6379> pfadd key "sql"
(integer) 1
127.0.0.1:6379> pfadd key "redis"
(integer) 0
127.0.0.1:6379> pfcount key
(integer) 2

From the result of pfcount, we can see that in the HLL structure with key as the key value, there are 2 unique values: redis and sql. It can be seen that the result is quite accurate.

Syntax:

pfcount key [key ...]

This command supports counting one or more HLL structures.

Merge one or more HLL into a new structure #

Merge k and k2 into a new structure k3, the code is as follows:

127.0.0.1:6379> pfadd k "java" "sql"
(integer) 1
127.0.0.1:6379> pfadd k2 "redis" "sql"
(integer) 1
127.0.0.1:6379> pfmerge k3 k k2
OK

127.0.0.1:6379> pfcount k3
(integer) 3

Related syntax:

pfmerge destkey sourcekey [sourcekey ...]

Use Cases of pfmerge

When we need to merge the access data of two or more similar pages, we can use pfmerge to perform the operation.

Code Implementation #

Next, we will use Java code to implement three basic functions of HLL. The code is as follows:

import redis.clients.jedis.Jedis;

public class HyperLogLogExample {
    public static void main(String[] args) {
        Jedis jedis = new Jedis("127.0.0.1", 6379);
        // Add elements
        jedis.pfadd("k", "redis", "sql");
        jedis.pfadd("k", "redis");
        // Count elements
        long count = jedis.pfcount("k");
        // Print count
        System.out.println("k: " + count);
        // Merge HLL
        jedis.pfmerge("k2", "k");
        // Print new HLL
        System.out.println("k2: " + jedis.pfcount("k2"));
    }
}

The output of the above code is as follows:

k: 2
k2: 2

Principles of HLL Algorithm #

The HyperLogLog algorithm is based on the paper HyperLogLog the analysis of a near-optimal cardinality estimation algorithm. To understand the principle of HLL, we need to start with Bernoulli trials, which are about flipping coins. A Bernoulli trial is equivalent to flipping a coin, and as long as a head appears in any number of flips, it is considered a successful Bernoulli trial.

We use k to represent the number of times a coin is flipped, n to represent the number of the trial, and k_max to represent the maximum number of coin flips. Based on estimation, it is found that there is a relationship between n and k_max: n = 2^(k_max). However, we also discovered another problem that when the number of trials is small, this estimation method has a large error. For example, let’s consider the following 3 trials:

Trial 1: 3 flips with a head, k = 3, n = 1
Trial 2: 2 flips with a head, k = 2, n = 2
Trial 3: 6 flips with a head, k = 6, n = 3

For these three trials, k_max = 6, n = 3, but it is clear that 3 ≠ 2^6 according to the estimation formula. To solve this problem, HLL introduces bucketing algorithm and harmonic mean to make the algorithm closer to the real situation.

The bucketing algorithm divides the original data into m equal parts and calculates the average in each part, then multiplies the average by m to reduce the error caused by randomness and improve the accuracy of estimation. In simple terms, it means dividing the data into multiple parts and conducting multiple rounds of calculations instead of a single round of calculation.

The harmonic mean refers to using the harmonic mean algorithm instead of directly using the arithmetic mean.

For example, if Xiao Ming’s monthly salary is 1000 yuan, and Xiao Wang’s monthly salary is 100000 yuan, if we directly take the average, Xiao Ming’s average salary will become (1000+100000)/2 = 50500 yuan, which is obviously inaccurate. However, using the harmonic mean algorithm, the result is 2/(1⁄1000+1⁄100000)≈1998 yuan, which is obviously more accurate as the average salary.

Therefore, considering the above situations, using HLL to insert data in Redis means that the stored values are hashed and then converted into binary values, which are stored in different buckets. In this way, a small amount of space can be used to store a large amount of data. When performing statistics, the corresponding positions can be compared quickly to reach conclusions. This is the basic principle of the HLL algorithm. If you want a deeper understanding of the algorithm and its reasoning process, you can refer to the original paper, the link to which is provided at the end of this text.

Summary #

When large-scale data statistics are needed, ordinary set types cannot meet our requirements. At this time, we can use HyperLogLog provided in Redis 2.8.9 for statistics. Its advantage is that it only needs to use 12k space to count 2^64 data. However, it has a disadvantage of 0.81% error. HyperLogLog provides three operations: pfadd for adding elements, pfcount for counting elements, and pfmerge for merging elements.

References #

Paper HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm