02 Implementation of String in Key Value Pairs Using Char or Structure

02 Implementation of String in Key-Value Pairs Using char or Structure #

Strings are very common in our everyday application development, such as when we need to record user information, product information, status information, and so on. These all require the use of strings.

For Redis, the key in a key-value pair is a string, and sometimes the value is also a string. For example, when we write user information into Redis, we may record the user’s name, gender, city, etc. These are all strings, as shown below:

SET user:id:100 {"name": "zhangsan", "gender": "M", "city":"beijing"}

In addition, the commands and data exchanged between Redis instances and clients are also represented in strings.

Therefore, since the usage of strings is so extensive and crucial, when implementing strings, we should strive to meet the following three requirements:

  • Support a wide range of efficient string operations, such as string concatenation, copying, comparison, and getting the length.
  • Be able to store arbitrary binary data, such as images.
  • Minimize memory overhead as much as possible.

In fact, if you have developed C language programs before, you should know that strings can be implemented using character arrays (char) in the C language. The C language standard library string.h also defines various string manipulation functions, such as the string comparison function strcmp, string length calculation function strlen, string concatenation function strcat, etc. This allows developers to directly call these functions to perform string operations.

So it seems that Redis can completely reuse the implementation of strings in the C language, right?

However, in reality, when using C language strings, we often need to manually check and allocate string space, which increases the workload of code development. Moreover, data such as images cannot be saved as strings, which limits the scope of application.

Therefore, from a system design perspective, how should we design and implement strings?

In fact, Redis has designed a structure called Simple Dynamic String (SDS) to represent strings. Compared to the implementation of strings in the C language, the implementation of SDS improves the efficiency of string operations and can be used to store binary data.

So in today’s class, I will introduce to you the design concepts and implementation techniques of the SDS structure. This way, you can understand the deficiencies of implementing strings with char* and the advantages of SDS, as well as learn the implementation techniques of compact memory structures. If you want to implement string types in your own system software, you can refer to Redis’s design concepts to improve operation efficiency and save memory overhead.

Alright, next let’s understand why Redis did not reuse the implementation methods for strings in the C language.

Why doesn’t Redis use char*? #

In order to answer this question, we need to understand the structure characteristics of the char* string array and what Redis needs in terms of strings. So let’s analyze it in more detail.

Structure of char* #

First, let’s look at the structure of the char* string array.

The structure of the char* string array is very simple. It is a contiguous block of memory that stores each character of the string. For example, the following diagram shows the array structure of the string “redis” using char*:

‘r’ ’e' ’d' ‘i’ ’s' ‘\0’

As shown in the diagram, the last character of the character array is ‘\0’. What is the purpose of this character? In fact, when operating on strings in the C language, the char* pointer only points to the start of the character array, and the ‘\0’ at the end of the character array represents the end of the string.

In this way, the string manipulation functions in the C standard library will check whether there is a ‘\0’ in the character array to determine the end of the string. For example, the strlen function is a string manipulation function that returns the length of a string. This function traverses each character in the character array and counts them until the checked character is ‘\0’. At this point, the strlen function stops counting and returns the number of characters it has counted. The following diagram shows the execution process of the strlen function:

[Diagram: strlen execution process]

Let’s use some code to see how the ‘\0’ termination character affects the length of a string. Here, I created two string variables, a and b, and assigned them the values “red\0is” and “redis\0” respectively. Then, I use the strlen function to calculate the length of these two strings, as shown below:

#include <stdio.h>
#include <string.h>
int main()
{
   char *a = "red\0is";
   char *b = "redis\0";
   printf("%lu\n", strlen(a));
   printf("%lu\n", strlen(b));
   return 0;
}

After executing this code, the output will be 3 and 5, indicating that the lengths of a and b are 3 and 5 respectively. This is because in a, the end character ‘\0’ is after the 3 characters “red”, and in b, the end character is after the 5 characters “redis”.

In other words, the char* string uses ‘\0’ to indicate the end of the string, which actually has a negative impact on storing data. If the data we want to store already contains ‘\0’, the data will be truncated at the ‘\0’ position, which does not meet Redis’s requirement to store arbitrary binary data.

Complexity of operation functions #

In addition to the design issues of the char* string array structure, using ‘\0’ as the end character of a string, although it allows string manipulation functions to determine the end position of the string, it also brings another negative impact, which is an increase in the complexity of operation functions.

Let’s take the strlen function as an example again. This function needs to traverse each character in the character array to obtain the length of the string, so the complexity of this operation function is O(N).

Let’s also look at another commonly used operation function: the strcat function. The strcat function appends a source string src to the end of a target string dest. The code for this function is shown below:

char *strcat(char *dest, const char *src) {
   char *tmp = dest;
   while(*dest)
      dest++;
   while((*dest++ = *src++) != '\0')
   return tmp;
}

From the code, we can see that the strcat function is similar to the strlen function. Both have high complexity and both need to determine the end of the target string by traversing the string. In addition, when appending the source string to the end of the target string, it is necessary to ensure that the target string has enough available space, otherwise appending will not be possible.

Therefore, it requires developers to ensure that the target string has enough space when calling strcat, otherwise dynamic allocation of space is required, which increases the complexity of programming. Once the complexity of operation functions increases, it will affect the efficiency of string operations, which does not meet Redis’s requirement for efficient string operations.

Well, considering the two major drawbacks of using char* to implement strings in the C language, now we need to find a new way to implement strings. So next, let’s learn how Redis designs and considers the implementation of strings.

Design Philosophy of SDS #

Since Redis is developed using the C language, in order to reuse the string manipulation functions provided by the C standard library as much as possible, Redis retains the use of character arrays to store actual data. However, unlike the use of character arrays alone in the C language, Redis specifically designed the Simple Dynamic String (SDS) data structure. Let’s take a look together.

SDS Structure Design #

First, the SDS structure contains a character array buf[] to store the actual data. In addition, the SDS structure also contains three metadata: the length len of the character array, the allocated space length alloc for the character array, and the SDS type flags. Redis defines multiple data types for len and alloc, which can be used to represent different types of SDS, and I will explain them in detail later. The figure below shows the structure of SDS, you can take a look first.

SDS Structure

In addition, if you have looked up the definition of SDS in the Redis source code, you may have noticed that Redis uses typedef to define an alias for the char* type, which is sds, as shown below:

typedef char *sds;

Actually, this is because SDS is essentially a character array, but with additional metadata on top of the character array. When Redis needs to use a character array, it directly uses the sds alias.

At the same time, when creating a new string, Redis calls the SDS creation function sdsnewlen. The sdsnewlen function creates an SDS variable (which is a char* variable) and a new SDS structure, and assigns the buf[] array in the SDS structure to the sds variable. Finally, the sdsnewlen function copies the newly created string to the sds variable. The following code shows the operation logic of the sdsnewlen function, you can take a look:

sds sdsnewlen(const void *init, size_t initlen) {
    void *sh;  // Pointer to the SDS structure
    sds s;     // sds variable, i.e., char* character array

    ...
    sh = s_malloc(hdrlen+initlen+1);   // Create an SDS structure and allocate memory space
    ...
    s = (char*)sh+hdrlen;              // The sds variable points to the buf array in the SDS structure, sh points to the start of the SDS structure, hdrlen is the length of the metadata in the SDS structure
    ...
    if (initlen && init)
        memcpy(s, init, initlen);    // Copy the string to the sds variable s
    s[initlen] = '\0';               // Add \0 at the end of the variable s to indicate the end of the string
    return s;
}

Alright, now that you have understood the definition of the SDS structure, let’s take a look at the improvements in efficiency of SDS operations compared to traditional C language strings.

Efficiency of SDS Operations #

Because the SDS structure records the space occupied by the character array and the space allocated to the character array, it can bring higher operational efficiency compared to traditional C language implementations of strings.

I will take string appending as an example. The function for string appending in Redis is the sdscatlen function in the sds.c file. This function has three parameters: the target string s, the source string t, and the length len to be appended. The source code is shown below:

sds sdscatlen(sds s, const void *t, size_t len) {
    // Get the current length of the target string s
    size_t curlen = sdslen(s);
    // Check if new space needs to be allocated based on the length len to be appended and the current length of the target string s
    s = sdsMakeRoomFor(s,len);
    if (s == NULL) return NULL;
    // Copy the data of length len from the source string t to the end of the target string
    memcpy(s+curlen, t, len);
    // Set the new length of the target string: the original length curlen plus the length of the copied data
    sdssetlen(s, curlen+len);
    // Add \0 at the end of the target string after the copy
    s[curlen+len] = '\0';
    return s;
}

By analyzing the source code of this function, we can see that the implementation of sdscatlen is relatively simple, and its execution process can be divided into three steps:

  • First, get the current length of the target string and call the sdsMakeRoomFor function to check if new space needs to be allocated based on the current length and the length to be appended. This step is mainly to ensure that the target string has enough space to receive the appended string.
  • Second, after ensuring that the space of the target string is sufficient, append the data of length len from the source string to the target string.
  • Finally, set the new length of the target string.

I have created a diagram showing the execution process of sdscatlen, you can take a look.

sdscatlen Execution

So, by now you can see that compared to string operations in the C language, SDS avoids the traversal of strings by recording the length of the character array being used and the allocated space size. This reduces the operation overhead and helps to efficiently perform various string operations such as creation, appending, copying, and comparison. This design philosophy is worth learning.

In addition, SDS encapsulates the space checking and resizing for the target string in the sdsMakeRoomFor function, and directly calls this function in operations involving changes in the string space, such as appending and copying.

This design implementation avoids the situation where the operation fails due to forgetting to resize the target string. For example, when we use the function strcpy(char *dest, const char *src), if the length of src is greater than the length of dest and we don’t check it in the code, it will cause a buffer overflow. So this design philosophy of encapsulating operations is also worth learning.

Now, apart from using metadata to record the length of the character array and encapsulating operations, what other excellent designs and implementations of SDS are worth learning? This is related to the memory-saving requirement I mentioned earlier.

So next, let’s take a look at how SDS achieves memory saving in programming techniques.

Compact String Structure Programming Techniques #

As mentioned earlier, the SDS structure has a metadata field flags that represents the SDS type. In fact, SDS has a total of 5 types: sdshdr5, sdshdr8, sdshdr16, sdshdr32, and sdshdr64. The main difference between these 5 types lies in the data types of the length len and allocation size alloc fields in their data structures.

Since Redis no longer uses the sdshdr5 type, let’s focus on the remaining 4 types. Take sdshdr8 as an example. Its definition is as follows:

struct __attribute__ ((__packed__)) sdshdr8 {
    uint8_t len; /* length of the character array */
    uint8_t alloc; /* allocated space for the character array, excluding the struct and the ending \0 character */
    unsigned char flags; /* SDS type */
    char buf[]; /* character array */
};

We can see that the data types of the length len and allocation size alloc fields are both uint8_t. uint8_t is an 8-bit unsigned integer that occupies 1 byte of memory. When the string type is sdshdr8, it can represent character array lengths (including the ending \0 character) up to 256 bytes (2^8 equals 256).

As for the sdshdr16, sdshdr32, and sdshdr64 types, their len and alloc fields are of uint16_t, uint32_t, and uint64_t data types respectively. This means that they can represent character array lengths up to 2^16, 2^32, and 2^64 respectively. The memory space occupied by these two metadata fields in the sdshdr16, sdshdr32, and sdshdr64 types is 2 bytes, 4 bytes, and 8 bytes respectively.

In fact, the reason why SDS has different structure headers (i.e., different types) is to be able to efficiently store strings of different sizes and thus save memory space effectively. Because when storing strings of different sizes, the memory space occupied by the structure header is also different, so when storing small strings, the structure header occupies relatively less space.

Otherwise, if all SDS had the same size structure header, such as using uint64_t to represent len and alloc, then suppose we need to store a 10-byte string. At this time, the len and alloc fields in the structure header alone would already occupy 16 bytes, which is more than the data being stored. Such a design is not memory-friendly and does not meet Redis’s memory-saving requirements.

Ok, in addition to designing different types of structure headers, Redis also uses special compile optimizations to save memory space. In the sdshdr8 structure definition we introduced earlier, we can see that __attribute__ ((__packed__)) is used between struct and sdshdr8, as shown below:

struct __attribute__ ((__packed__)) sdshdr8

In fact, the purpose of __attribute__ ((__packed__)) is to tell the compiler to allocate memory for the sdshdr8 structure in a compact way without byte alignment. This is because by default, the compiler allocates memory for variables according to 8-byte alignment. In other words, even if a variable is less than 8 bytes in size, the compiler will allocate 8 bytes for it.

To help you understand, let me give you an example. Suppose I define a structure s1 with two member variables, char and int, as shown below:

#include <stdio.h>
int main() {
   struct s1 {
      char a;
      int b;
   } ts1;
   printf("%lu\n", sizeof(ts1));
   return 0;
}

Although char occupies 1 byte and int occupies 4 bytes, if you run this code, you will find that the printed result is 8. This is because by default, the compiler allocates 8 bytes of space for the s1 structure, and as a result, 3 bytes are wasted.

In order to save memory, Redis’s design in this regard can be said to be carefully considered. Therefore, Redis uses the __attribute__ ((__packed__)) attribute to define the structure, so that the actual memory occupied by the structure will be allocated by the compiler.

For example, I define structure s2 using the __attribute__ ((__packed__)) attribute, which also contains two member variables of type char and int, as shown below:

#include <stdio.h>
int main() {
   struct __attribute__((packed)) s2 {
      char a;
      int b;
   } ts2;
   printf("%lu\n", sizeof(ts2));
   return 0;
}

When you run this code, you can see that the printed result is 5, indicating that the compiler uses compact memory allocation and the s2 structure occupies only 5 bytes of space.

In summary, if you want to save memory overhead in data structure development, you can use the __attribute__ ((__packed__)) programming technique.

Summary #

In this lesson, I mainly introduced the design and implementation of strings in Redis. You need to know that the implementation of strings needs to consider efficient operations, the ability to store arbitrary binary data, and the demand for memory saving. The way Redis designs and implements strings is worth learning and referencing.

Therefore, in this lesson, you need to pay attention to three key points:

  • The inadequacy of implementing strings using char* in C language is mainly because the use of “\0” to indicate the end of the string requires traversing the string during operations, which is not efficient. It is also unable to fully represent data that contains “\0”, which cannot meet the requirements of Redis.
  • The design concept and implementation method of strings in Redis. Redis has specially designed the SDS data structure. Based on the character array, it adds metadata such as the length of the character array and the allocated space size. In this way, operations such as appending, copying, and comparing based on string length can directly read the metadata, which improves efficiency. Moreover, SDS does not determine the end of the string based on the “\0” character in the string, but directly treats it as binary data, which can be used to store binary data such as images.
  • SDS represents strings of different sizes by designing different SDS types, and uses the programming technique attribute (( packed )) to achieve a compact memory layout and save memory.

Strings may seem simple, but through today’s lesson, you can see that there are many intricacies in implementing strings. The relationship and differences between the implementation methods of C language strings and SDS are also frequently asked questions in Redis interviews, so I hope you can master the differences between them through today’s lesson.

One question per class #

SDS strings are widely used in the internal implementation of Redis modules. Can you find places in the Redis server and client implementation where SDS strings are used?

Feel free to share your thoughts and processes in the comments. Let’s exchange and discuss together. If you find this helpful, feel free to share today’s content with more friends.