21 Cursor Iterators and Filter Scan

21 Cursor Iterators and Filter Scan #

A “Bloodshed” Caused by a Problem #

There was once an incident where our Redis server stored massive amounts of data, including user login information stored in the form of user_token_id. The operations team wanted to retrieve the login information for all current users, and tragedy struck: because our engineers used keys user_token_* to query the corresponding users, Redis became unresponsive and unavailable, ultimately affecting other online businesses with a series of issues and triggering a flood of system alert messages. Furthermore, the downtime was directly proportional to the amount of stored data - the larger the data set, the longer the system remained unresponsive, resulting in a longer outage duration.

So how can we avoid this problem?

The Solution to the Problem #

Before Redis 2.8, we could only use the keys command to query the data we wanted, but this command had two drawbacks:

  1. This command does not have pagination functionality, meaning that we can only retrieve all key values that meet the given condition at once. If the query result is very large, the output information will also be extensive.
  2. The keys command is a traversal query, so its query time complexity is O(n). Therefore, the larger the data set, the longer the query time.

Fortunately, in Redis 2.8, Scan was introduced to solve these problems. Let’s take a look at how to use Scan.

Using the Scan Command #

First, let’s simulate a large amount of data by using the Pipeline to add 100,000 data entries. The Java code implementation is as follows:

import redis.clients.jedis.Jedis;
import redis.clients.jedis.Pipeline;
import utils.JedisUtils;

public class ScanExample {
    public static void main(String[] args) {
        // Add 100,000 data entries
        initData();
    }
    public static void initData(){
        Jedis jedis = JedisUtils.getJedis();
        Pipeline pipe = jedis.pipelined();
        for (int i = 1; i < 100001; i++) {
            pipe.set("user_token_" + i, "id" + i);
        }
        // Execute the command
        pipe.sync();
        System.out.println("Data insertion completed");
    }
}

Now, let’s query the data with user IDs starting with 9999*. The usage of the Scan command is as follows:

127.0.0.1:6379> scan 0 match user_token_9999* count 10000
1) "127064"
2) 1) "user_token_99997"
127.0.0.1:6379> scan 127064 match user_token_9999* count 10000
1) "1740"
2) 1) "user_token_9999"
127.0.0.1:6379> scan 1740 match user_token_9999* count 10000
1) "21298"
2) 1) "user_token_99996"
127.0.0.1:6379> scan 21298 match user_token_9999* count 10000
1) "65382"
2) (empty list or set)
127.0.0.1:6379> scan 65382 match user_token_9999* count 10000
1) "78081"
2) 1) "user_token_99998"
   2) "user_token_99992"
127.0.0.1:6379> scan 78081 match user_token_9999* count 10000
1) "3993"
2) 1) "user_token_99994"
   2) "user_token_99993"
127.0.0.1:6379> scan 3993 match user_token_9999* count 10000
1) "13773"
2) 1) "user_token_99995"
127.0.0.1:6379> scan 13773 match user_token_9999* count 10000
1) "47923"
2) (empty list or set)
127.0.0.1:6379> scan 47923 match user_token_9999* count 10000
1) "59751"
2) 1) "user_token_99990"
   2) "user_token_99991"
   3) "user_token_99999"
127.0.0.1:6379> scan 59751 match user_token_9999* count 10000
1) "0"
2) (empty list or set)

From the above execution results, we can see two issues:

  1. The query result is empty, but the cursor value is not 0, indicating that the traversal is not yet complete.
  2. The count is set to 10000, but the number of results returned each time is not consistently 10000, and the value is not fixed. This is because count only limits the number of dictionary slots traversed by the server in a single iteration (approximately), and does not determine the maximum count of results returned.

Syntax:

scan cursor [MATCH pattern] [COUNT count]

Where:

  • cursor: The cursor position, an integer value, starting from 0 and ending with 0. When the query result is empty but the cursor value is not 0, it means that the traversal is not yet complete.
  • MATCH pattern: Regular expression pattern for matching fields.
  • COUNT count: A hint for the server to limit the number of dictionary slots traversed in a single iteration. It’s not the maximum count of returned results. The default value is 10.

Code Implementation #

In this article, we use Java code to implement the Scan query functionality. The code is as follows:

import redis.clients.jedis.Jedis;
import redis.clients.jedis.Pipeline;
import redis.clients.jedis.ScanParams;
import redis.clients.jedis.ScanResult;
import utils.JedisUtils;

public class ScanExample {
    public static void main(String[] args) {
        Jedis jedis = JedisUtils.getJedis();
        // Define the match and count parameters
        ScanParams params = new ScanParams();
        params.count(10000);
        params.match("user_token_9999*");
        // Cursor
        String cursor = "0";
        while (true) {
            ScanResult<String> res = jedis.scan(cursor, params);
            if (res.getCursor().equals("0")) {
                // Represents the last one
                break;
            }
            cursor = res.getCursor(); // Set the cursor
            for (String item : res.getResult()) {
                // Print the query result
                System.out.println("Query result: " + item);
            }
        }
    }
}

The results of the above program execution are as follows:

Query Result: user_token_99997
Query Result: user_token_9999
Query Result: user_token_99996
Query Result: user_token_99998
Query Result: user_token_99992
Query Result: user_token_99994
Query Result: user_token_99993
Query Result: user_token_99995
Query Result: user_token_99990
Query Result: user_token_99991
Query Result: user_token_99999

Scan command details #

Scan is a series of commands, including Scan itself and the following 3 commands:

  1. HScan: iterates over a hash’s dictionary cursor
  2. SScan: iterates over a set’s cursor
  3. ZScan: iterates over a sorted set’s cursor

Let’s take a look at how to use these commands.

HScan usage #

127.0.0.1:6379> hscan myhash 0 match k2* count 10
1) "0"
2) 1) "k2"
   2) "v2"

Syntax:

hscan key cursor [MATCH pattern] [COUNT count]

SScan usage #

127.0.0.1:6379> sscan myset 0 match v2* count 20
1) "0"
2) 1) "v2"

Syntax:

sscan key cursor [MATCH pattern] [COUNT count]

ZScan usage #

127.0.0.1:6379> zscan zset 0 match red* count 20
1) "0"
2) 1) "redis"
   2) "10"

Syntax:

zscan key cursor [MATCH pattern] [COUNT count]

Scan explanation #

Official description of the Scan command:

Scan guarantees #

The SCAN command, and the other commands in the SCAN family, are able to provide to the user a set of guarantees associated to full iterations.

  • A full iteration always retrieves all the elements that were present in the collection from the start to the end of a full iteration. This means that if a given element is inside the collection when an iteration is started, and is still there when an iteration terminates, then at some point SCANreturned it to the user.
  • A full iteration never returns any element that was NOT present in the collection from the start to the end of a full iteration. So if an element was removed before the start of an iteration, and is never added back to the collection for all the time an iteration lasts, SCAN ensures that this element will never be returned.

However because SCAN has very little state associated (just the cursor) it has the following drawbacks:

  • A given element may be returned multiple times. It is up to the application to handle the case of duplicated elements, for example only using the returned elements in order to perform operations that are safe when re-applied multiple times.
  • Elements that were not constantly present in the collection during a full iteration, may be returned or not: it is undefined.

Official documentation link:

https://redis.io/commands/scan

The meaning of the translation is: Scan and its related commands guarantee the following query rules.

  • It can return all elements that appear in the collection from the start to the end of the query if these elements are not deleted and match the query condition.
  • It guarantees that elements that were deleted before the start of the query will not be returned.

However, Scan commands have the following drawbacks:

  • A given element may be returned multiple times, so the client needs to deduplicate the results.
  • Elements that were not constantly present in the collection during a full iteration may or may not be returned. This behavior is undefined.

Summary #

From this article, we can understand the following four commands related to Scan:

  1. Scan: used to retrieve all data in the current database.
  2. HScan: used to retrieve data from a hash type.
  3. SScan: used to retrieve data from a set type.
  4. ZScan: used to retrieve data from a sorted set.

Scan has the following characteristics:

  1. Scan can match keys.
  2. Scan queries using a cursor, which does not cause Redis to hang.
  3. Scan provides the count parameter to specify the number of iterations.
  4. Scan returns the cursor to the client, allowing the client to continue iterating.
  5. Scan may return duplicate data, so the client needs to deduplicate.
  6. If a single query returns an empty value and the cursor is not 0, it means the iteration has not finished.
  7. Scan guarantees that elements deleted before the start of the query will not be queried.
  8. Scan does not guarantee that elements modified during the iteration can be queried.