11 Why String in Redis Is Not Efficient Anymore

11 Why String in Redis is Not Efficient Anymore #

Starting today, we will enter the “Practical Section”. Over the next 5 lessons, we will learn about “Data Structures”. I will introduce data types and their underlying data structures that save memory and allow for storing and analyzing massive amounts of data. We will also explore specific use cases such as address location queries, time series database read/write operations, and message queue storage and retrieval. Furthermore, I will share specific solutions using Redis data types and module extension capabilities to meet these demands.

Today, let’s first understand the issue of memory consumption with the String data type and explore the solutions for choosing data types that save memory.

Let me share with you a requirement that I encountered in the past.

At that time, we needed to develop an image storage system that could quickly record the Image ID and the corresponding storage ID (referred to as the Image Storage Object ID) within the storage system. Additionally, it was also necessary to be able to quickly retrieve the Image Storage Object ID based on the Image ID.

Due to the large number of images, we decided to represent the Image ID and Image Storage Object ID using 10-digit numbers. For example, if an Image ID is 1101000051, its corresponding ID within the storage system is 3301000051.

photo_id: 1101000051
photo_obj_id: 3301000051

As you can see, the Image ID and Image Storage Object ID have a one-to-one correspondence, which is a typical “key-value” pattern. The term “single value” refers to the value in a key-value pair being a single value rather than a collection, and this aligns perfectly with the storage format provided by the String data type, where one key corresponds to one value.

Moreover, the String data type can store binary byte streams, similar to a “magic treatment”. By converting the data into a binary byte array, it can be stored.

Therefore, our initial solution was to use String to store the data. We saved the Image ID and Image Storage Object ID as the key-value pairs, where the Image Storage Object ID was of type String.

Initially, we stored 100 million images, consuming approximately 6.4GB of memory. However, as the number of images increased, our Redis memory usage also increased, resulting in slow response times due to the generation of RDB by the large memory Redis instance. Clearly, the String data type is not a good choice, and we need to further explore alternative data types that save memory.

In this process, I delved into the underlying structure of the String data type and discovered the reason for its high memory consumption. I gained a new understanding of the String data type, realizing that it is not suitable for all scenarios and has a significant drawback: it consumes a large amount of memory space when storing data.

At the same time, I also carefully studied data structures for sets. I found that sets have a highly efficient underlying implementation that saves memory space. However, the data pattern saved by sets is one where a key corresponds to a series of values, which does not directly apply to storing single value key-value pairs. Therefore, we used a two-level encoding method to achieve storing single value key-value pairs with sets, resulting in a noticeable reduction in Redis instance memory consumption.

In this lesson, I will share with you the knowledge and methods I acquired while solving this problem, including where the memory space is consumed in the String data type, which data structures can save memory, and how to use set types to store single value key-value pairs. If you have encountered high memory consumption issues while using the String data type, you can try today’s solution.

Next, let’s take a look at where the memory of the String data type is consumed.

Why does the String data type have a large memory overhead? #

In the previous case, we stored information for 100 million images, which used approximately 6.4GB of memory. On average, a record consisting of an image ID and image storage object ID used 64 bytes.

However, the actual amount of memory needed for a group of image IDs and their storage object IDs is only 16 bytes.

Let’s analyze the situation. Both the image ID and the image storage object ID are 10-digit numbers, which can be represented by two 8-byte Long types. Since an 8-byte Long type can represent a value up to 2 to the power of 64, it is certainly capable of representing a 10-digit number. So why does the String type use 64 bytes?

In fact, apart from storing the actual data, the String type also requires additional memory to store information such as data length and space usage, which are referred to as metadata. When the amount of actual data being saved is small, the memory overhead of the metadata becomes relatively large, somewhat overshadowing the actual data.

So, how exactly does the String type store data? Let me explain.

When you save a 64-bit signed integer, the String type stores it as an 8-byte Long integer. This storing method is usually referred to as integer encoding.

However, when the data being saved includes characters, the String type uses a structure called the Simple Dynamic String (SDS) to store it, as shown in the following diagram:

  • buf: A byte array that stores the actual data. To indicate the end of the byte array, Redis automatically adds a “\0” at the end, which incurs an additional 1-byte cost.
  • len: Takes up 4 bytes and represents the length of the used portion of buf.
  • alloc: Also takes up 4 bytes and represents the actual allocated length of buf, which is usually greater than len.

As you can see, in SDS, buf holds the actual data, while len and alloc are additional overhead in the SDS structure.

In addition to the memory overhead from SDS, the String type also incurs overhead from the RedisObject structure.

Since Redis has multiple data types and some common metadata needs to be recorded for these different data types (such as last access time and reference count), Redis uses a RedisObject structure to uniformly record this metadata and point to the actual data.

A RedisObject contains 8 bytes of metadata and an 8-byte pointer, which further points to the memory address where the actual data of the specific data type (e.g., the SDS structure of the String type) is stored. You can see a schematic diagram below. I will provide detailed information about the structure of RedisObject in later lessons. For now, it is enough to understand its basic structure and the overhead of metadata.

To save memory space, Redis has designed specialized layouts for Long integers and SDS.

On one hand, when saving a Long integer, the pointer in the RedisObject is directly assigned the value of the integer, eliminating the need for an additional pointer and saving the space overhead of the pointer.

On the other hand, when saving string data and the string length is less than or equal to 44 bytes, the metadata, pointer, and SDS in the RedisObject are laid out contiguously in memory, avoiding memory fragmentation. This layout method is also referred to as the embstr encoding.

Of course, when the string length exceeds 44 bytes, the amount of data in the SDS increases, and Redis no longer lays out the SDS and RedisObject together. Instead, it allocates separate space for the SDS and uses a pointer to point to the SDS structure. This layout method is known as the raw encoding mode.

To help you understand the int, embstr, and raw encoding modes, I have created a diagram shown below:

Well, now that you understand the additional metadata overhead in RedisObject, we can calculate the memory usage of the String type.

Since the image ID and image storage object ID, both 10-digit numbers, can be saved using int encoding in a RedisObject. The metadata portion of each int-encoded RedisObject takes up 8 bytes, and the pointer portion is directly assigned an 8-byte integer. In this case, each ID will use 16 bytes, so the total is 32 bytes. But where did the other 32 bytes go?

I mentioned in [Lesson 2] that Redis uses a global hash table to store all key-value pairs, and each item in the hash table is a dictEntry structure that points to a key-value pair. The dictEntry structure contains three 8-byte pointers: one for the key, one for the value, and one for the next dictEntry. These three pointers add up to 24 bytes, as shown in the following diagram:

However, even though these three pointers occupy only 24 bytes, why does it take up 32 bytes? This brings us to the memory allocator used by Redis, jemalloc.

When jemalloc allocates memory, it finds the next power of 2 that is greater than or equal to the number of bytes requested (let’s say N) as the allocated space to reduce the frequency of allocations.

For example, if you request 6 bytes, jemalloc will actually allocate 8 bytes; if you request 24 bytes, jemalloc will allocate 32 bytes. Therefore, in the scenario we just discussed, the dictEntry structure occupies 32 bytes.

Now that you understand the situation, you should realize that even though there are only 16 bytes of actual information, using the String type to store image ID and image storage object ID requires 64 bytes of memory space. Let’s do the math: if you have 100 million images to store, then you would need 6.4GB of memory space for the 100 million image ID records, with 4.8GB of that memory space being used to save metadata. The additional memory overhead is significant. So, is there a more memory-efficient method?

Which data structure can save memory? #

Redis has a low-level data structure called a compressed list (ziplist), which is a very memory-saving structure.

Let’s review the composition of a compressed list. The list head has three fields: zlbytes, zltail, and zllen, which represent the length of the list, the offset of the list tail, and the number of entries in the list, respectively. The compressed list tail also has a zlend, which indicates the end of the list.

The reason why compressed lists can save memory is that they use a series of continuous entries to store data. The metadata of each entry includes the following parts:

  • prev_len: Represents the length of the previous entry. prev_len has two possible values: 1 byte or 5 bytes. When the value is 1 byte, it means that the length of the previous entry is less than 254 bytes. Although a 1-byte value can represent a range of values from 0 to 255, the default value of zlend in the compressed list is 255. Therefore, 255 is used to represent the end of the entire compressed list, and the value 255 cannot be used in other places that represent length. So, when the length of the previous entry is less than 254 bytes, prev_len takes 1 byte as the value. Otherwise, it takes 5 bytes as the value.
  • len: Represents the length of the entry itself, 4 bytes;
  • encoding: Represents the encoding method, 1 byte;
  • content: Stores the actual data.

These entries are placed one by one in memory without the need for additional pointers to connect them, saving the space occupied by the pointers.

Let’s take an example of saving the ID of a picture storage object to analyze how compressed lists save memory space.

Each entry saves the ID of a picture storage object (8 bytes). In this case, prev_len of each entry only needs 1 byte because the length of the previous entry for each entry is only 8 bytes, which is less than 254 bytes. In this way, the amount of memory occupied by the ID of a picture storage object is 14 bytes (1+4+1+8=14), but actually allocated 16 bytes.

Redis implemented collection types like Lists, Hashes, and Sorted Sets based on compressed lists. The biggest benefit of doing this is saving the overhead of dictEntry. When you use the String type, a key-value pair has a dictEntry, which takes up 32 bytes of space. But when using collection types, a key corresponds to the data of a collection, which can store much more data, but only one dictEntry is used, saving memory.

This solution sounds good, but there is still a problem. When using collection types to store key-value pairs, one key corresponds to the data of a collection. But in our scenario, one image ID only corresponds to the ID of a picture storage object. So, how can we use a collection type to save this single-value key-value pair?

How to save key-value pairs with a collection type? #

When saving key-value pairs with a single value, you can use a two-level coding method based on the Hash type. The two-level encoding mentioned here means splitting a single value data into two parts, using the first part as the key of the Hash collection, and the second part as the value of the Hash collection. In this way, we can save the single value data in the Hash collection.

For example, taking the image ID 1101000060 and the image storage object ID 3302000080, we can use the first 7 digits of the image ID (1101000) as the key of the Hash type, and use the last 3 digits of the image ID (060) and the image storage object ID as the key and value of the Hash type.

According to this design method, I inserted a set of records of image IDs and their storage object IDs into Redis and used the info command to check the memory consumption. I found that after adding a record, the memory usage only increased by 16 bytes, as shown below:

127.0.0.1:6379> info memory
# Memory
used_memory:1039120
127.0.0.1:6379> hset 1101000 060 3302000080
(integer) 1
127.0.0.1:6379> info memory
# Memory
used_memory:1039136

When using the String type, each record consumes 64 bytes, but this method only uses 16 bytes, which is 1/4 of the original memory space, meeting our requirement to save memory space.

However, you may have doubts: “Do we have to use the first 7 digits of the image ID as the key of the Hash type and the last 3 digits as the key of the Hash type value in the two-level encoding method?” In fact, the ID length used in the two-level encoding method is critical.

In Lecture 2, I introduced two underlying implementation structures of the Redis Hash type, namely, ziplist and hashtable.

So, when does the Hash type use ziplist and when does it use hashtable? In fact, the Hash type has two thresholds when it is set to save data using ziplist. Once these thresholds are exceeded, the Hash type will switch to using hashtable to save data.

These two thresholds correspond to the following two configuration items:

  • hash-max-ziplist-entries: represents the maximum number of elements in the hash set when saved using ziplist.
  • hash-max-ziplist-value: represents the maximum length of a single element in the hash set when saved using ziplist.

If the number of elements we write to the Hash set exceeds hash-max-ziplist-entries, or if the size of an individual element exceeds hash-max-ziplist-value, Redis will automatically switch the implementation structure of the Hash type from ziplist to hashtable.

Once it has switched from ziplist to hashtable, the Hash type will continue to use hashtable for storage and will not switch back to ziplist. In terms of saving memory space, hashtable is not as efficient as ziplist.

In order to fully utilize the compact memory layout of ziplist, we generally control the number of elements saved in the Hash set. So, in the two-level encoding we just mentioned, we only use the last 3 digits of the image ID as the key of the Hash set, which also ensures that the number of elements in the Hash set does not exceed 1000. At the same time, we set hash-max-ziplist-entries to 1000, so that the Hash set can continue to use ziplist to save memory space.

Summary #

In this lesson, we have debunked the misconception about String being versatile and suitable for all occasions. When the memory space occupied by the key-value pair itself is not large (such as the mentioned image ID and image storage object ID), the metadata overhead of the String type dominates. This includes the memory overhead of the RedisObject structure, SDS structure, and dictEntry structure.

For such cases, we can use the compressed list (压缩列表) to store the data. Of course, when using a Hash collection type to store key-value pairs with single-value data, we need to split the single-value data into two parts and use them as the key and value of the Hash collection. Just like in the previous example, we used binary encoding to represent the image ID. I hope you can apply this method to your own scenarios.

Finally, I would like to provide you with a small tip: if you want to know the memory overhead when storing key-value pairs using different types, you can enter the length of your key-value pair and the data type used in this website. This way, you can know the actual memory consumption. I recommend using this tool, as it can help you save memory efficiently.

One question per lesson #

As usual, here’s a little question for you: Besides the String and Hash types, do you think there are any other suitable types that can be applied to the example of saving images discussed in this lesson?

Feel free to write down your thoughts and answers in the comments section. Let’s exchange and discuss together. You are also welcome to share today’s content with your friends.