14 How to Store Time Series Data in Redis

14 How to Store Time Series Data in Redis #

When we develop Internet products, we often have a need to record user behavior data on websites or apps in order to analyze user behavior. This data usually includes user ID, action type (such as browsing, logging in, placing an order, etc.), and the timestamp of the action:

UserID, Type, TimeStamp

I previously worked on an IoT project with a similar data storage requirement. We needed to periodically collect real-time status information from thousands of devices, including device ID, pressure, temperature, humidity, and corresponding timestamps:

DeviceID, Pressure, Temperature, Humidity, TimeStamp

A set of data related to timestamps like these is called time series data. The characteristic of this data is that it does not have a strict relational model, and the information can be represented as a key-value relationship (for example, one device ID corresponds to one record). Therefore, it is not necessary to use a relational database like MySQL to store it. Redis, with its key-value data model, is perfectly suited for this data storage requirement. Redis provides two solutions based on its own data structures and extension modules.

In this lesson, I will use the example of calculating device status metrics in an IoT scenario to discuss the practices and pros and cons of different solutions.

As the saying goes, “Know thyself, know thy enemy. A thousand battles, a thousand victories.” So let’s start by examining the characteristics of reading and writing time series data to determine what kind of data type should be used.

Characteristics of reading and writing time series data #

In practical applications, time series data is often continuously written with high concurrency. For example, real-time status values of tens of thousands of devices need to be continuously recorded. At the same time, the main purpose of writing time series data is to insert new data, rather than updating existing data. In other words, once a time series data record is made, it usually does not change because it represents the status value of a device at a certain moment in time (e.g., temperature measurement of a device at a certain moment, once recorded, the value itself will not change).

Therefore, the writing characteristics of this kind of data are simple, that is, fast data insertion. This requires us to choose data types that have low complexity and do not block when inserting data. At this point, you may immediately think of using Redis’ String and Hash types to save data because their insertion complexity is O(1), which is a good choice. However, as I mentioned in [Lesson 11], when recording small data with a String type (such as the temperature value of a device in the previous example), the memory overhead of metadata is relatively large, so it is not suitable for storing large amounts of data.

Now let’s take a look at the characteristics of “reading” time series data.

When querying time series data, we not only query individual records (e.g., querying the operating status information of a device at a certain moment, which corresponds to one record of this device), but also query data within a certain time range (e.g., the status information of all devices from 8 am to 10 am every day).

In addition, there are more complex queries, such as performing aggregation calculations on data within a certain time range. Aggregation calculations here refer to calculations performed on all data that meet the query conditions, including calculating averages, maximum/minimum values, and sums, etc. For example, we want to calculate the maximum pressure of devices during a certain time period to determine if there is a malfunction.

In summary, the “reading” of time series data can be characterized by multiple query patterns.

After understanding the characteristics of reading and writing time series data, let’s take a look at how to store this data in Redis. Let’s analyze it: for the requirement of fast writing of time series data, Redis’ high-performance writing feature can directly meet the requirement. As for the requirement of “multiple query patterns”, which means supporting single-point queries, range queries, and aggregation calculations, Redis provides two solutions for storing time series data, which can be implemented based on Hash and Sorted Set, as well as the RedisTimeSeries module.

Next, let’s learn about the first solution.

Saving Time Series Data with Hash and Sorted Set #

There is an obvious advantage to using a combination of Hash and Sorted Set: they are built-in data types in Redis with mature code and stable performance. Therefore, it is expected that the system would be stable when saving time series data based on these two data types.

However, in the scenarios we have learned so far, we have only used a single data type to store and retrieve data. So, why do we need to use these two types simultaneously when saving time series data? This is the first question we need to answer.

Regarding the Hash type, we all know that it has a feature that enables quick querying of a single key. This satisfies the requirement of querying time series data by a single key. We can use the timestamp as the key of the Hash set and the recorded device status value as the value of the Hash set.

Let’s take a look at the diagram below, which illustrates using a Hash set to record the temperature values of a device:

When we want to query the temperature data at a specific time point or multiple time points, we can directly use the HGET command or HMGET command to obtain the value(s) of a single key or multiple keys in the Hash set.

For example, we can use the HGET command to query the temperature at the time point 202008030905, and use the HMGET command to query the temperatures at the time points 202008030905, 202008030907, and 202008030908, as shown below:

HGET device:temperature 202008030905
"25.1"

HMGET device:temperature 202008030905 202008030907 202008030908
1) "25.1"
2) "25.9"
3) "24.9"

As you can see, it is easy to implement querying a single key using the Hash type. However, the Hash type has a limitation: it does not support range queries for data.

Although the time series data is inserted into the Hash set in ascending order of time, the underlying structure of the Hash type is a hash table without an ordered index for the data. Therefore, if we want to perform a range query on the Hash type, we need to scan all the data in the Hash set, retrieve these data to the client for sorting, and then obtain the data within the specified range on the client side. Clearly, the query efficiency is low.

To support range queries based on timestamps, we can use the Sorted Set to store time series data, as the Sorted Set can sort the elements based on their scores. We can use the timestamp as the score of the elements in the Sorted Set and use the recorded data on a specific time point as the elements themselves.

Let’s take the example of saving time series data for device temperatures to explain further. The diagram below shows the results stored in a Sorted Set:

After using a Sorted Set to store the data, we can use the ZRANGEBYSCORE command to query the temperature values within a specified time range based on the input maximum timestamp and minimum timestamp. For example, let’s query all the temperature values between 9:07 and 9:10 on August 3, 2020:

ZRANGEBYSCORE device:temperature 202008030907 202008030910
1) "25.9"
2) "24.9"
3) "25.3"
4) "25.2"

Now we know that by using Hash and Sorted Set together, we can meet the requirements of querying data for individual time points and time ranges. However, we will face a new problem, which is the second question we need to answer: how can we ensure that writing to Hash and Sorted Set is an atomic operation?

The term “atomic operation” refers to the situation where multiple write commands (e.g., using the HSET command and ZADD command to write data to the Hash and Sorted Set respectively) either all complete or none of them complete. Only by ensuring the atomicity of write operations can we ensure that the same time-series data is either saved in both Hash and Sorted Set or not saved in either. Otherwise, it may happen that the time-series data exists in the Hash set but not in the Sorted Set. In that case, it would not be possible to satisfy the query requirements when performing range queries.

So how does Redis ensure atomicity in operations? This is where the MULTI and EXEC commands come in, which Redis uses to implement simple transactions. When multiple commands and their parameters are correct, the MULTI and EXEC commands ensure the atomicity of executing these commands. I will introduce Redis’s transaction support and atomicity guarantee for exceptional situations in Lesson 30. For now, let’s just understand how to use the MULTI and EXEC commands.

The MULTI command indicates the beginning of a series of atomic operations. After receiving this command, Redis knows that the commands to follow need to be put into an internal queue and executed together to ensure atomicity.
The EXEC command indicates the end of a series of atomic operations. Once Redis receives this command, it means that all the commands that need to be executed atomically have been sent. At this point, Redis starts executing all the commands that were put into the internal queue.

You can refer to the diagram below. Commands 1 to N are sent after the MULTI command and before the EXEC command. They will be executed together, ensuring atomicity.

Taking the requirement of saving device status information as an example, we execute the following code to write the temperature of a device at 9:05 AM on August 3, 2020, using the HSET command for the Hash set and the ZADD command for the Sorted Set.

127.0.0.1:6379> MULTI
OK

127.0.0.1:6379> HSET device:temperature 202008030911 26.8
QUEUED

127.0.0.1:6379> ZADD device:temperature 202008030911 26.8
QUEUED

127.0.0.1:6379> EXEC
1) (integer) 1
2) (integer) 1

As you can see, Redis first receives the MULTI command executed by the client. Then, when the client executes the HSET and ZADD commands, Redis returns “QUEUED” as the result, indicating that these two commands are temporarily enqueued and not executed immediately. Only when the EXEC command is executed, the HSET and ZADD commands are actually executed and return the success results (integer value of 1).

With this, we have solved the problem of single-point queries and range queries for time-series data, and used the MUTLI and EXEC commands to ensure that Redis can atomically save data to the Hash and Sorted Set. Next, we need to continue to solve the third problem: how to perform aggregation calculations on time-series data.

Aggregation calculations are generally used to periodically summarize the data within a time window. They are frequently performed in real-time monitoring and alerting scenarios.

Since Sorted Set only supports range queries and cannot directly perform aggregation calculations, we have to retrieve the data within the time range back to the client and perform the aggregation calculations on the client side. Although this approach can achieve aggregation calculations, it comes with potential risks, that is, a large amount of data is frequently transmitted between the Redis instance and the client, competing with other operation commands for network resources and causing other operations to become slow.

In our IoT project, we need to periodically calculate the temperature status of various devices every 3 minutes. Once the temperature of a device exceeds the set threshold, an alert needs to be triggered. This is a typical aggregation calculation scenario. Let’s take a look at the volume of data during this process.

Assuming we need to calculate the maximum value of all indicators for each device every 3 minutes, and each device records an indicator value every 15 seconds. This means that 4 values will be recorded in one minute, and there will be 12 values in 3 minutes. We have 33 indicators to be monitored, so each device will have almost 400 data points (33 * 12 = 396) every 3 minutes. And there are 10,000 devices in total, resulting in nearly 4 million (396 * 10,000 = 3.96 million) data points that need to be transmitted between the client and the Redis instance every 3 minutes.

To avoid frequent and large data transfers between the client and the Redis instance, we can use RedisTimeSeries to store time-series data.

RedisTimeSeries supports performing aggregation calculations directly on the Redis instance. Let’s continue with the example of calculating the maximum value every 3 minutes. By performing the aggregation calculations directly on the Redis instance, for each indicator value of a single device, the 12 data points recorded every 3 minutes can be aggregated to a single value. So, for each device, there will only be 33 aggregated values to transmit every 3 minutes, and for 10,000 devices, there will be only 330,000 data points. The data volume is approximately one-tenth of the aggregation calculations performed on the client side, greatly reducing the impact of data transfer on the network performance of the Redis instance.

Therefore, if we only need to perform single-point queries or queries within a certain time range, it is suitable to use the combination of Hash and Sorted Set. They are inherent data structures of Redis, with good performance and high stability. However, if we need to perform a large number of aggregation calculations and the network bandwidth conditions are not ideal, the combination of Hash and Sorted Set is not very suitable. In such cases, using RedisTimeSeries is more appropriate.

Okay, next, let’s learn about RedisTimeSeries in more detail.

Saving Time Series Data with RedisTimeSeries Module #

RedisTimeSeries is an extension module for Redis that provides specialized data types and access interfaces for time series data. It allows for aggregation calculations on data based on time ranges directly on a Redis instance.

Since RedisTimeSeries is not a built-in module of Redis, we need to compile its source code into a dynamic link library called redistimeseries.so and load it using the loadmodule command, as shown below:

loadmodule redistimeseries.so

When using RedisTimeSeries for time series data storage and access, there are 5 main operations:

Use the TS.CREATE command to create a time series data collection.
Use the TS.ADD command to insert data.
Use the TS.GET command to read the latest data.
Use the TS.MGET command to filter and query data collections based on labels.
Use the TS.RANGE command to perform range queries with aggregation calculations.

Now, let’s go through how to use these 5 operations.

1. Create a Time Series Data Collection with the TS.CREATE command

In the TS.CREATE command, we need to set the key of the time series data collection and the expiration time for the data in milliseconds. Additionally, we can set labels for the data collection to represent its properties.

For example, by executing the following command, we create a time series data collection with a key of “device:temperature”, a data expiration of 600s (meaning the data will be automatically deleted after 600s), and a label attribute of {device_id:1} indicating that the records in this collection belong to device ID 1.

TS.CREATE device:temperature RETENTION 600000 LABELS device_id 1
OK

2. Insert Data with the TS.ADD Command and Read the Latest Data with the TS.GET Command

We can use the TS.ADD command to insert data into a time series collection, including a timestamp and the corresponding value. The TS.GET command is used to read the latest data from the data collection.

For example, by executing the following TS.ADD command, we can insert a data record into the “device:temperature” collection, recording the device temperature at 9:05am on August 3rd, 2020. Then, by executing the TS.GET command, we can retrieve the newly inserted latest data.

TS.ADD device:temperature 1596416700 25.1
1596416700

TS.GET device:temperature 
25.1

3. Filter and Query Data Collections Based on Labels with the TS.MGET Command

When storing time series data for multiple devices, it is common to save the data of different devices in separate collections. In this scenario, we can use the TS.MGET command to query the latest data from specific collections based on labels. When creating data collections with TS.CREATE, we can set labels for the collections. During queries, we can match the label attributes of the collections with the filter condition in order to return the latest data from the matched collections.

For example, let’s assume we have 4 collections for 4 devices with IDs 1, 2, 3, and 4 respectively. We set the device_id label for each collection during creation. Using the TS.MGET command with the FILTER setting (used to set the filter conditions for label attributes), we can query all the data collections except for the device with ID 2 and retrieve the latest data from each matched collection.

TS.MGET FILTER device_id!=2 
1) 1) "device:temperature:1"
   2) (empty list or set)
   3) 1) (integer) 1596417000
      2) "25.3"
2) 1) "device:temperature:3"
   2) (empty list or set)
   3) 1) (integer) 1596417000
      2) "29.5"
3) 1) "device:temperature:4"
   2) (empty list or set)
   3) 1) (integer) 1596417000
      2) "30.1"

4. Perform Range Queries with Aggregation Calculations using the TS.RANGE Command

Finally, when performing aggregation calculations on time series data, we can use the TS.RANGE command to specify the time range of the data to be queried. The AGGREGATION parameter is used to specify the type of aggregation calculation to be performed. RedisTimeSeries supports various types of aggregation calculations, such as average (avg), maximum/minimum value (max/min), and sum (sum).

For example, by executing the following command, we can calculate the average values within every 180s time windows for the data between 9:05am and 9:12am on August 3rd, 2020.

TS.RANGE device:temperature 1596416700 1596417120 AGGREGATION avg 180000
1) 1) (integer) 1596416700
   2) "25.6"
2) 1) (integer) 1596416880
   2) "25.8"
3) 1) (integer) 1596417060
   2) "26.1"

Compared to using Hash and Sorted Set to store time series data, RedisTimeSeries is an extension module specifically designed for time series data storage and access. It allows for direct aggregation calculations and filtering of data collections based on label attributes on a Redis instance, making it advantageous when frequent aggregation calculations and filtering of specific device or user data collections from a large number of collections are required.

Summary #

In this lesson, we learned how to use Redis to store time series data. The characteristics of writing time series data are fast writing, and there are three characteristics of querying:

Point query, querying data for a specific timestamp;
Range query, querying data within a range of start and end timestamps;
Aggregation calculation, performing calculations on all data within a range of start and end timestamps, such as finding the maximum/minimum value, calculating the average, etc.

Regarding the requirement for fast writing, Redis’s high-performance writing capabilities are sufficient. As for the diverse querying needs, Redis provides two solutions.

The first solution is to use the combination of Redis’s built-in Hash and Sorted Set types, saving the data in both a Hash collection and a Sorted Set collection. This solution can utilize the Hash type to achieve fast querying of a single key and also leverage the Sorted Set to efficiently support range queries, satisfying the two major querying needs of time series data at once.

However, the first solution also has two drawbacks: one is that when performing aggregation calculations, the data needs to be read into the client before aggregation. When there is a large amount of data to be aggregated, the data transfer overhead is significant. Second, all the data will be saved in both data types, resulting in a significant memory overhead. However, we can mitigate this by setting appropriate data expiration times to release memory and reduce memory pressure.

The second solution we learned is to use the RedisTimeSeries module. This is an extension module specifically designed for storing and accessing time series data. Compared to the first solution, RedisTimeSeries supports various data aggregation calculations directly on the Redis instance, avoiding the need for extensive data transfer between the instance and the client. However, RedisTimeSeries uses a linked list as the underlying data structure, and its complexity for range queries is O(N). Additionally, its TS.GET query can only return the latest data and does not have the capability, like the Hash type in the first solution, to return data for any given timestamp.

Therefore, using a combination of Hash and Sorted Set, or using RedisTimeSeries, each has its pros and cons in supporting the storage and retrieval of time series data. My suggestion for you is:

If you have a high network bandwidth and large Redis instance memory in your deployment environment, you can prioritize the first solution;
If you have limited network and memory resources in your deployment environment, and you have a large amount of data with frequent aggregation calculations and the need to query by data set properties, you can prioritize the second solution.

One Question per Lesson #

As usual, I have a small question for you.

In this lesson, I mentioned that we can use Sorted Sets to store time series data, using timestamps as scores and the actual data as members. Do you think there are any potential risks in storing data this way? Additionally, if you were a developer and maintainer of Redis, would you design aggregate calculations as an inherent feature of Sorted Sets?

Alright, that’s it for this lesson. If you feel like you’ve gained something, feel free to share today’s content with your friends or colleagues. See you in the next lesson.