29 Practical Deployment Filter Installation, Usage, and Principle Analysis

29 Practical Deployment Filter Installation, Usage, and Principle Analysis #

We have previously mentioned that HyperLogLog can be used for cardinality estimation, but it does not provide a method to determine if a value exists. So how can we check if a value exists in a huge dataset?

Using traditional methods such as SQL queries is inefficient and resource-intensive due to the large amount of data. Therefore, we need an excellent algorithm and functionality to fulfill this requirement. This is what we are going to discuss today - Bloom Filters.

Enabling Bloom Filters #

In Redis, we cannot directly use Bloom Filters, but we can introduce them through the modules provided by Redis 4.0 and later versions. This article provides two ways to enable Bloom Filters.

Method 1: Compilation #

1. Download and install Bloom Filters

git clone https://github.com/RedisLabsModules/redisbloom.git
cd redisbloom
make # compile redisbloom

If the compilation is successful, a redisbloom.so file will be generated in the root directory.

2. Start the Redis server

> ./src/redis-server redis.conf --loadmodule ./src/modules/RedisBloom-master/redisbloom.so

Here, --loadmodule means loading the extension module, followed by the directory of the redisbloom.so file.

Method 2: Docker #

docker pull redislabs/rebloom # pull the Docker image
docker run -p6379:6379 redislabs/rebloom # run the container

Start Verification #

After starting the service, we need to check if the Bloom Filter has been enabled. To do this, we can simply connect to the server using redis-cli and see if the command prompt shows bf.add, as shown in the figure below:

image.png

If the command prompt appears, it means that the Redis server has successfully enabled Bloom Filters.

Using Bloom Filters #

Bloom Filters have a limited number of commands, mainly including the following:

  1. bf.add: Add an element
  2. bf.exists: Check if an element exists
  3. bf.madd: Add multiple elements
  4. bf.mexists: Check if multiple elements exist
  5. bf.reserve: Set the accuracy of the Bloom Filter

Here is an example of how to use it:

127.0.0.1:6379> bf.add user xiaoming
(integer) 1
127.0.0.1:6379> bf.add user xiaohong
(integer) 1
127.0.0.1:6379> bf.add user laowang
(integer) 1
127.0.0.1:6379> bf.exists user laowang
(integer) 1
127.0.0.1:6379> bf.exists user lao
(integer) 0
127.0.0.1:6379> bf.madd user huahua feifei
1) (integer) 1
2) (integer) 1
127.0.0.1:6379> bf.mexists user feifei laomiao
1) (integer) 1
2) (integer) 0

As we can see from the above results, there is no error. Let’s take a look at the usage of bf.reserve, which is used to set the accuracy of the Bloom Filter:

127.0.0.1:6379> bf.reserve user 0.01 200
(error) ERR item exists # The key already exists, so an error will be reported
127.0.0.1:6379> bf.reserve userlist 0.9 10
OK

As illustrated, this command must be executed before any elements are stored, otherwise an error will occur. It has three parameters: key, error_rate, and initial_size.

Where:

  • error_rate: The allowed error rate of the Bloom Filter. The lower the value, the larger the space the filter occupies. This is because this value determines the size of the bit array used to store the results. The larger the space occupied (more information stored), the lower the error rate. The default value is 0.01.
  • initial_size: The size of the elements to be stored in the Bloom Filter. If the actual value stored is greater than this value, the accuracy will decrease. The default value is 100.

The following section will explain the specific reasons why error_rate and initial_size affect accuracy.

Code Implementation #

Next, we will use a Java client to implement Bloom Filter operations. Since Jedis does not directly support Bloom Filter operations, we will use Jedis to execute Lua scripts to implement Bloom Filters. Here is the code:

import redis.clients.jedis.Jedis;
import utils.JedisUtils;

import java.util.Arrays;

public class BloomExample {
    private static final String _KEY = "user";

    public static void main(String[] args) {
        Jedis jedis = JedisUtils.getJedis();
        for (int i = 1; i < 10001; i++) {
            bfAdd(jedis, _KEY, "user_" + i);
            boolean exists = bfExists(jedis, _KEY, "user_" + i);
            if (!exists) {
                System.out.println("Data not found for i=" + i);
                break;
            }
        }
        System.out.println("Execution completed");
    }

    /**
     * Add an element
     * @param jedis Redis client
     * @param key   key
     * @param value value
     * @return boolean
     */
    public static boolean bfAdd(Jedis jedis, String key, String value) {
        String luaStr = "return redis.call('bf.add', KEYS[1], KEYS[2])";
        Object result = jedis.eval(luaStr, Arrays.asList(key, value), ...

[Message clipped]

import redis.clients.jedis.Jedis;
import utils.JedisUtils;

import java.util.Arrays;

public class BloomExample {
    private static final String _KEY = "userlist";

    public static void main(String[] args) {
        Jedis jedis = JedisUtils.getJedis();
        for (int i = 1; i < 100001; i++) {
            bfAdd(jedis, _KEY, "user_" + i);
            boolean exists = bfExists(jedis, _KEY, "user_" + (i + 1));
            if (exists) {
                System.out.println("Found " + i);
                break;
            }
        }
        System.out.println("Execution completed");
    }

    /**
     * Add an element
     * @param jedis Redis client
     * @param key   key
     * @param value value
     * @return boolean
     */
    public static boolean bfAdd(Jedis jedis, String key, String value) {
        String luaStr = "return redis.call('bf.add', KEYS[1], KEYS[2])";
        Object result = jedis.eval(luaStr, Arrays.asList(key, value),
                Arrays.asList());
        if (result.equals(1L)) {
            return true;
        }
        return false;
    }

    /**
     * Check if an element exists
     * @param jedis Redis client
     * @param key   key
     * @param value value
     * @return boolean
     */
    public static boolean bfExists(Jedis jedis, String key, String value) {
        String luaStr = "return redis.call('bf.exists', KEYS[1], KEYS[2])";
        Object result = jedis.eval(luaStr, Arrays.asList(key, value),
                Arrays.asList());
        if (result.equals(1L)) {
            return true;
        }
        return false;
    }
}

But after executing for a while, the result turns out to be:

Execution completed

No errors whatsoever. It’s strange, so we added a 0 after the loop count, and after executing for a long time, we found that the result remained the same, with no errors.

This is because, for a Bloom filter, if it says an element doesn’t exist, it definitely doesn’t, but if it says an element exists, it might not necessarily exist.

So we modified the program, changed to a different key value, and modified the condition to search for existing data. The code is as follows:

import redis.clients.jedis.Jedis;
import utils.JedisUtils;

import java.util.Arrays;

public class BloomExample {
    private static final String _KEY = "userlist";

    public static void main(String[] args) {
        Jedis jedis = JedisUtils.getJedis();
        for (int i = 1; i < 100001; i++) {
            bfAdd(jedis, _KEY, "user_" + i);
            boolean exists = bfExists(jedis, _KEY, "user_" + (i + 1));
            if (exists) {
                System.out.println("Found " + i);
                break;
            }
        }
        System.out.println("Execution completed.");
    }

    /**
     * Add an element
     * @param jedis Redis client
     * @param key   key
     * @param value value
     * @return boolean
     */
    public static boolean bfAdd(Jedis jedis, String key, String value) {
        String luaStr = "return redis.call('bf.add', KEYS[1], KEYS[2])";
        Object result = jedis.eval(luaStr, Arrays.asList(key, value),
                Arrays.asList());
        if (result.equals(1L)) {
            return true;
        }
        return false;
    }

    /**
     * Check if an element exists
     * @param jedis Redis client
     * @param key   key
     * @param value value
     * @return boolean
     */
    public static boolean bfExists(Jedis jedis, String key, String value) {
        String luaStr = "return redis.call('bf.exists', KEYS[1], KEYS[2])";
        Object result = jedis.eval(luaStr, Arrays.asList(key, value),
                Arrays.asList());
        if (result.equals(1L)) {
            return true;
        }
        return false;
    }
}

This time, after executing for a while, we found the following information:

Found 344
Execution completed

This indicates that there’s an error in the loop execution after a while, and the code execution matches the expectation of the Bloom filter.

Principle #

Based on Redis’ implementation of the Bloom filter, it relies on a bit array in its data structure. Instead of directly storing the data in the data structure, which would consume a lot of space, it uses several different unbiased hash functions to evenly store the hash values of an element in the bit array. In other words, each time an element is added, its position is calculated using several unbiased hash functions, and the corresponding positions are set to 1 to complete the addition operation.

When checking an element, it queries whether the values at the hash positions of this element are all 1. If they’re all 1, it means the value exists; if at least one value is 0, it means the value doesn’t exist. Since the position is calculated through hashing, even if the position is 1, it can’t determine which element set it to 1. Therefore, when the Bloom filter queries for the existence of a value, it might not actually exist, but when it queries for the nonexistence of a value, it definitely doesn’t exist.

And when the bit array stores a sparse value, the accuracy of the query is higher, but as more values are stored in the bit array, the error rate also increases.

The relationship between the bit array and the keys is shown in the following diagram:

image.png

Use cases for Bloom Filter #

Some classic use cases for Bloom filters include:

  • Spam email filtering
  • URL deduplication in web crawlers
  • Determining whether an element exists among billions of data

Bloom filters are also widely used in the database domain, for example, HBase, Cassandra, LevelDB, and RocksDB internally use Bloom filters.

Conclusion #

Through this article, we have learned how to enable the Bloom filter using Redis’ modules feature introduced after version 4.0. We have also learned the three important operations of the Bloom filter: bf.add for adding elements, bf.exists for checking element existence, and bf.reserve for setting the accuracy of the Bloom filter. bf.reserve has two important parameters: error rate and array size. A lower error rate and larger array size require more storage space, but relatively speaking, the error rate for queries is lower. How to set them depends on the user’s actual situation. We also know the characteristics of the Bloom filter: when it queries the existence of a value, the value might not actually exist, but when it queries the nonexistence of a value, the value definitely doesn’t exist.