28 How Pika Implements Large Capacity Redis Based on Ssd

28 How Pika Implements Large-capacity Redis Based on SSD #

When using Redis in applications, as business data increases (for example, in e-commerce businesses with increasing user and product quantities), it becomes necessary for Redis to store more data. You might think of using a Redis sharded cluster to distribute the data across multiple instances. However, this approach has a problem: if the total amount of data to be stored is large but the amount of data stored in each instance is small, it will result in an increase in the size of the cluster instances, making cluster management more complex and increasing costs.

You might then suggest increasing the memory capacity of individual Redis instances to create instances with large memory, allowing each instance to store more data. In this way, when storing the same amount of data, the number of large memory instances required will be reduced, saving costs.

This is a good idea, but it is not a perfect solution. Large memory instances based on big memory can cause a series of potential problems during instance recovery and master-slave synchronization, such as increased recovery time, high switching costs, and buffer overflow.

What can be done then? I recommend using Solid State Drives (SSD). SSD has low cost (about one-tenth the cost per GB of memory) and large capacity, and it provides fast read and write speeds. We can use SSD to achieve large-capacity Redis instances. Pika, a key-value database developed jointly by 360 Database Administrators (DBA) and the Infrastructure Team, just meets this requirement.

When Pika was initially designed, it had two goals: first, to enable a single instance to store large amounts of data while avoiding potential problems during instance recovery and master-slave synchronization; second, to maintain compatibility with Redis data types and support smooth migration of applications using Redis to Pika. Therefore, if you have been using Redis and want to use SSD to expand the capacity of a single instance, Pika is a good choice.

In this lesson, I will talk to you about Pika. Before introducing Pika, let me explain in detail the potential problems of implementing large-capacity Redis instances based on big memory. Only by understanding these problems can we choose a more suitable solution. Additionally, I will guide you step by step to analyze how Pika achieves the two design goals we discussed earlier and solves these problems.

Potential Issues with Large Memory Redis Instances #

Redis uses memory to store data, and increasing the memory capacity can lead to two potential issues: low efficiency in generating and recovering RDB snapshots, and increased time and buffer overflow during full synchronization between master and slave nodes. Let’s explain each of these in detail.

First, let’s look at the impact on RDB snapshots. The relationship between memory size and RDB snapshots is very direct: as the memory capacity of the instance increases, the RDB file will also increase in size. As a result, the fork time during RDB file generation will also increase, causing the Redis instance to block. Furthermore, the time it takes to recover using the RDB file will also increase, resulting in an extended period where Redis cannot provide services.

Next, let’s consider the impact on master-slave synchronization.

The first step in the synchronization between master and slave nodes is to perform a full synchronization. Full synchronization involves the master node generating an RDB file and transmitting it to the slave nodes, which then load the file. Now imagine that if the RDB file is very large, it will undoubtedly increase the time for full synchronization, resulting in reduced efficiency. Moreover, it may also lead to buffer overflow during replication. Once the buffer overflows, the master and slave nodes will start full synchronization again, affecting the normal operation of the business application. If we increase the capacity of the replication buffer, it will consume valuable memory resources.

In addition, if the master fails and a master-slave switch occurs, all other slave nodes need to perform another full synchronization with the new master. If the RDB file is large, it will also increase the time for master-slave switching, which will affect the availability of the business.

So, how does Pika solve these two issues? This requires mentioning the key modules in Pika, namely RocksDB, binlog mechanism, and Nemo. These modules are all important components of the Pika architecture. Therefore, next, let’s have a look at the overall architecture of Pika.

Architecture of Pika #

The overall architecture of Pika key-value database consists of five components: the network framework, Pika thread module, Nemo storage module, RocksDB, and the binlog mechanism, as shown in the following diagram:

These five components implement different functions, and I will introduce them one by one.

Firstly, the network framework is mainly responsible for receiving and sending low-level network requests. Pika’s network framework encapsulates the low-level network functions provided by the operating system. When Pika communicates over the network, it can directly invoke the functions encapsulated by the network framework.

Secondly, the Pika thread module adopts a multi-threaded model to handle client requests. It includes a dispatch thread, a group of worker threads, and a thread pool.

The dispatch thread is dedicated to listening to the network port. Once a connection request is received from a client, it establishes a connection with the client and hands over the connection to a worker thread for processing. The worker thread is responsible for receiving specific command requests sent by the client on the connection and encapsulating the command requests into tasks. These tasks are then handed over to the threads in the thread pool, where the actual data storage and retrieval operations are performed, as shown in the following diagram:

When using Pika in practical applications, we can increase the number of worker threads and the number of threads in the thread pool to improve the request processing throughput of Pika and meet the performance requirements of data processing at the business level.

The Nemo module is easy to understand. It implements compatibility between Pika and Redis data types. This means that when we migrate Redis services to Pika, we don’t need to modify the code in the business application that operates Redis. We can continue to apply the experience of managing Redis, which reduces the learning cost of Pika. The specific data type conversion mechanism implemented by the Nemo module is what we need to focus on, and I will provide a detailed introduction below.

Lastly, let’s take a look at the data storage capability provided by RocksDB based on SSD. It allows Pika to store more data without requiring large amounts of memory and avoids the use of memory snapshots. In addition, Pika uses the binlog mechanism to record write commands for command synchronization between master and slave nodes, avoiding potential issues with large memory instances during the synchronization process.

Next, let’s take a closer look at how Pika uses RocksDB and the binlog mechanism.

How does Pika store more data on SSD? #

To store data on SSD, Pika utilizes the widely adopted persistent key-value database RocksDB. The implementation mechanism of RocksDB is complex, but for the purpose of understanding and learning about Pika, it is sufficient to grasp the basic data read-write mechanism of RocksDB. Let me explain this basic mechanism.

Here, I will use an image to provide a specific introduction to the basic process of writing data in RocksDB.

When Pika needs to store data, RocksDB uses two small blocks of memory (Memtable1 and Memtable2) to alternately cache the written data. The size of the Memtable can be configured, usually ranging from a few MB to tens of MB. When data needs to be written to RocksDB, it is first written to Memtable1. After Memtable1 is full, RocksDB then writes the data to the underlying SSD as a file. At the same time, RocksDB replaces Memtable1 with Memtable2 to cache newly written data. When the data in Memtable1 has been written to the SSD, RocksDB will use Memtable1 again to cache the newly written data after Memtable2 is full.

From this analysis, you can see that RocksDB first caches the data using Memtable and then quickly writes the data to the SSD. Even if the data volume is large, all the data can be saved to the SSD. Moreover, the Memtable itself has a relatively small capacity, so even if RocksDB uses two Memtables, it does not occupy too much memory. Therefore, when Pika stores large capacity data, it does not require too much memory space.

When Pika needs to read data, RocksDB first queries the Memtable to see if the data to be read is present. This is because the newest data is always written to the Memtable first. If there is no data to be read in the Memtable, RocksDB will then query the data files stored on the SSD, as shown in the image below:

So far, you have learned that when RocksDB is used to store data, Pika can store a large amount of data on high-capacity SSDs, thus achieving large capacity instances. However, as I mentioned earlier, when Redis uses large memory instances to store a large amount of data, it faces efficiency issues with RDB generation and recovery, as well as efficiency and buffer overflow issues during master-slave synchronization. Will Pika face the same problems when storing a large amount of data?

Actually, it won’t. Let’s analyze it.

On the one hand, Pika saves data files based on RocksDB, so data can be restored directly by reading the data files without the need for memory snapshots. In addition, when the slave performs full synchronization, it can directly copy the data files from the master instead of using memory snapshots. As a result, Pika avoids the problem of low efficiency in generating large memory snapshots.

On the other hand, Pika uses the binlog mechanism to achieve incremental command synchronization, which saves memory and avoids buffer overflow issues. The binlog is a file stored on the SSD, and when Pika receives a write command, it writes the command operation to the binlog file while writing the data to the Memtable. Similar to Redis, after full synchronization, the slave will read the unsynchronized commands from the binlog file to keep its data consistent with the master. During incremental synchronization, the slave sends its replicated offset to the master, and the master sends the unsynchronized commands to the slave to ensure data consistency between the master and slave.

However, compared to using a buffer like Redis, using the binlog has obvious advantages. The binlog is a file stored on the SSD, and its size is not limited by the memory capacity like a buffer. Moreover, when the binlog file becomes large, a new binlog file can be generated through rotation operation, and the old binlog file can be independently stored. In this way, even if the Pika instance stores a large amount of data, there will be no buffer overflow issues during the synchronization process.

Now, let’s summarize briefly. Pika uses RocksDB to store a large amount of data on SSD, while avoiding the generation and recovery issues of memory snapshots. Moreover, Pika uses the binlog mechanism for master-slave synchronization, avoiding the impact of large memory usage. Thus, Pika achieves its first design goal.

Next, let’s take a look at how Pika achieves its second design goal, which is compatibility with Redis. After all, if it is not compatible, businesses that originally used Redis would not be able to smoothly migrate to Pika and utilize its advantages in storing large capacity data.

How does Pika achieve Redis data type compatibility? #

Pika uses RocksDB as the underlying storage to store data. However, RocksDB only provides a key-value type for single values, while Redis allows values in key-value pairs to be of collection types.

For Redis String type, it is itself a key-value pair of single values, so we can directly use RocksDB to store it. However, for collection types, we cannot directly save collections as key-value pairs of single values, and we need to perform conversion operations.

To maintain compatibility with Redis, Pika’s Nemo module is responsible for converting Redis collection types into key-value pairs of single values. In simple terms, we can divide Redis collection types into two categories:

One category is List and Set types, which also contain only single values in the collection.
The other category is Hash and Sorted Set types, which have paired elements in the collection. For Hash collections, the elements are of the field-value type, while for Sorted Set collections, the elements are of the member-score type.

Through conversion operations, the Nemo module converts these 4 types of collection elements into key-value pairs of single values. So how is the conversion done for each type? Let’s look at each type separately.

First, let’s look at the List type. In Pika, the key of the List collection is embedded in the key of the key-value pair of single values, represented by the key field. The value of the List collection is embedded in the value of the key-value pair of single values, represented by the value field. Because the elements in the List collection are ordered, the Nemo module also adds a sequence field after the key of the key-value pair, indicating the current element’s order in the List. In addition, the Nemo module adds previous sequence and next sequence fields in front of the value, representing the previous and next elements of the current element.

In addition, in front of the key of the key-value pair, the Nemo module adds a value “l” to indicate that the data is of List type, and adds a size field of 1 byte to indicate the size of the List collection key. After the value of the key-value pair, the Nemo module also adds version and ttl fields, representing the current data’s version number and remaining expiration time (to support the expired key feature), as shown below:

Now let’s look at the Set collection.

The key and member value of the Set collection are both embedded in the key of the key-value pair of Pika’s single values, represented by the key and member fields respectively. Similarly to the List collection, the key of the key-value pair of single values has a value “s” in front to indicate that the data is of Set type, and it also has a size field to indicate the size of the key. The value of the key-value pair of Pika’s single values only saves the data’s version information and remaining expiration time, as shown below:

For the Hash type, the key of the Hash collection is embedded in the key of the key-value pair of single values, represented by the key field, and the field of the Hash collection element is also embedded in the key immediately after the key field, represented by the field field. The value of the Hash collection element is embedded in the value of the key-value pair of single values, and also carries version information and remaining expiration time, as shown below:

Finally, for the Sorted Set type, this type needs to be sorted based on the score value of the collection elements, while RocksDB only supports sorting based on the key of the key-value pair of single values. Therefore, when converting data, the Nemo module embeds Sorted Set collection key, element’s score, and member value all into the key of the key-value pair of single values. At this time, the value of the key-value pair of single values only saves the data’s version information and remaining expiration time, as shown below:

With the above conversion method, Pika not only supports Redis data types but also preserves the characteristics of these data types, such as preserving the order of elements in the List and sorting elements in the Sorted Set based on the score. Once you understand Pika’s conversion mechanism, you will realize that if you have a business application plan to switch from using Redis to Pika, you don’t have to worry about the need to modify the business application due to incompatible operation interfaces.

Based on the analysis just now, we can learn that Pika can save large amounts of data based on SSD and is compatible with Redis, which are its two advantages. Next, let’s take a look at the other advantages and potential drawbacks of Pika compared to Redis. When using Pika in actual application, you need to pay special attention to its drawbacks, as they may require you to configure the system or optimize parameters.

Other Advantages and Disadvantages of Pika #

Compared to Redis, the most notable feature of Pika is its use of SSD for data storage. The direct benefit of this feature is that a single instance of Pika can store more data, enabling instance data expansion.

In addition, using SSD for data storage provides two additional advantages for Pika.

Firstly, fast instance restart. When data is written to the database, Pika saves it directly to the SSD. Therefore, when a Pika instance restarts, it can read the data directly from the SSD files, without having to reload all the data from the RDB file or replay all the operations from the AOF file, greatly improving the restart speed of Pika instances and enabling quick processing of business application requests.

Furthermore, the risk of full synchronization between master and slave databases is low. Pika achieves incremental synchronization of write commands through the binlog mechanism, which is not limited by the size of the memory buffer. Therefore, even in cases where the data volume is large and the synchronization between the master and slave databases takes a long time, Pika does not need to worry about buffer overflow triggering a full synchronization between the master and slave databases.

However, as I mentioned earlier in the course, “every coin has two sides,” Pika also has its own disadvantages.

Although it maintains Redis operation interfaces and can achieve database expansion, storing data on SSD reduces the accessibility performance of the data. This is because data operations cannot be directly executed in memory but need to be accessed in the underlying SSD, which inevitably affects the performance of Pika. Additionally, we need to synchronize the write commands recorded by the binlog mechanism to the SSD, which further reduces the write performance of Pika.

However, Pika’s multi-threaded model allows multiple threads to read and write data simultaneously, which partially compensates for the performance loss caused by accessing data from the SSD. Of course, you can also use high-performance SSDs to improve accessibility performance and reduce the impact of reading and writing to the SSD on Pika’s performance.

To help you gain a more intuitive understanding of Pika’s performance, I will provide you with a table, which is from the Pika official website.

These data are performance results of basic operations for the String and Hash types in the multi-threaded scenario, based on Pika version 3.2. From the table, we can see that without writing binlog, the performance of Pika’s SET/GET and HSET/HGET operations can reach over 200K OPS. However, once the write binlog operation is added, the performance of SET and HSET operations decreases by approximately 41% to only about 120K OPS.

Therefore, when using Pika, we need to balance between the necessity of expanding a single instance and the potential performance loss. If our primary requirement is to store a large amount of data, then Pika is a good solution.

Summary #

In this lesson, we learned about Pika, a technical solution for scaling Redis single instances based on SSD. Compared with Redis, Pika has obvious advantages: it supports Redis operations interfaces and can store large amounts of data. If you were already using Redis and now need to scale, Pika would undoubtedly be a good choice, requiring little additional effort for code migration or operations management.

However, since Pika stores data on SSD, read and write performance is weaker than Redis. To address this, I have two suggestions to minimize the impact on Pika’s performance caused by reading and writing to the SSD:

Utilize Pika’s multi-threaded model and increase the number of threads to improve Pika’s concurrent request processing capability.
Configure high-performance SSDs for Pika to enhance the SSD’s access performance.

Finally, I want to give you a small tip. Pika itself provides many tools that can help us migrate Redis data to Pika or forward Redis requests to Pika. For example, we can use the aof_to_pika command, specifying the Redis AOF file and Pika connection information, to migrate Redis data to Pika as shown below:

aof_to_pika -i [Redis AOF file] -h [Pika IP] -p [Pika port] -a [authentication information]

You can find information about these tools directly on Pika’s GitHub. Moreover, Pika itself is still under development, so I also recommend that you take a look at the GitHub repository to further understand it. This way, you can stay up-to-date with the latest developments in Pika and apply it better to your business practice.

One question per lesson #

As usual, I have a small question for you. In this lesson, I introduced using SSD as an extension for memory capacity to increase the data storage of Redis instances. I would like to discuss with you whether we can use mechanical hard drives as an instance capacity extension and what are the advantages and disadvantages.

Please share your thoughts and answers in the comments section so that we can exchange and discuss. If you found today’s content helpful, feel free to share it with your friends or colleagues. See you in the next lesson.