13 What Is Geo Can We Define New Data Types

13 What is GEO Can We Define New Data Types #

In Lesson 2, we learned about the 5 basic data types in Redis: String, List, Hash, Set, and Sorted Set. These data types can satisfy most of our data storage needs. However, when dealing with massive data statistics, they can consume a lot of memory and may not support some special scenarios. Therefore, Redis provides three additional data types for extension: Bitmap, HyperLogLog, and GEO. I have already discussed the first two types in the previous lesson. Today, I will explain GEO in more detail.

Furthermore, I will introduce the basic steps for developing custom data types. By mastering the development method of custom data types, you can overcome the limitations of basic data types and directly add customized data types in Redis to meet your special requirements.

Next, let’s first understand the implementation principle and usage of the extension data type GEO.

GEO data type for LBS applications #

In our daily lives, we increasingly rely on services that require location-based information, such as searching for “restaurants nearby” or requesting a ride on a taxi app. These applications are based on Location-Based Services (LBS). The data accessed by LBS applications consists of a set of latitude and longitude information associated with people or objects, and it should be possible to query adjacent latitude and longitude ranges. The GEO data type is ideal for LBS services. Let’s take a look at its underlying structure.

Underlying structure of GEO #

Generally, when designing the underlying structure of a data type, we first need to understand the access characteristics of the data. So, we need to first understand how location information is stored and accessed.

I will use a ride-hailing service as an example to analyze the access characteristics of latitude and longitude in LBS applications.

Each ride-hailing vehicle has a number (e.g., 33). The vehicle needs to send its longitude information (e.g., 116.034579) and latitude information (e.g., 39.000452) to the ride-hailing app.
When a user requests a ride, the ride-hailing app will search for nearby vehicles based on the user’s latitude and longitude position (e.g., longitude 116.054579, latitude 39.030452) and make a match.
Once the matching between nearby users and vehicles is made, the ride-hailing app will retrieve the information of the vehicle based on its number and return it to the user.

As we can see, one vehicle (or one user) corresponds to a set of latitude and longitude, and as the vehicle (or user) moves, the corresponding latitude and longitude will also change.

This data recording pattern corresponds to one key (e.g., vehicle ID) corresponding to one value (a set of latitude and longitude). When there is a need to store many vehicle information records, a collection is needed to store a series of keys and values. The Hash collection type can quickly store and retrieve a series of keys and values, which conveniently allows us to record the correspondence between vehicle IDs and their associated latitude and longitude information. The collection is depicted in the figure below:

Hash collection

At the same time, the HSET command of the Hash type can set the corresponding value based on the key. Therefore, it can be used to quickly update the latitude and longitude information of changing vehicles.

So far, the Hash type seems like a good choice. However, for an LBS application, in addition to recording latitude and longitude information, it is also necessary to perform range queries in the vehicle Hash collection based on the user’s latitude and longitude information. Once range queries are involved, it means that the elements in the collection need to be ordered. However, the elements of the Hash type are unordered, which obviously cannot meet our requirements.

Now let’s see if using the Sorted Set type is appropriate.

The Sorted Set type also supports the record pattern of one key corresponding to one value. In this type, the key is an element in the Sorted Set, and the value is the weight score of the element. More importantly, the elements in the Sorted Set can be sorted based on the weight score, supporting range queries. This can meet the requirements of searching for adjacent locations in LBS services.

In fact, the underlying data structure of the GEO type is implemented using the Sorted Set type. Let’s deepen our understanding using the example of a ride-hailing app.

When using a Sorted Set to store the latitude and longitude information of vehicles, the elements of the Sorted Set are vehicle IDs, and the weight score of the elements is the latitude and longitude information, as shown in the figure below:

The problem now is that the weight score of elements in the Sorted Set is a floating-point number (float type), while a set of latitude and longitude contains two values, longitude and latitude, which cannot be directly saved as a floating-point number. So how should we save them specifically?

This is where the GeoHash encoding in the GEO type comes into play.

Encoding Method of GeoHash #

In order to efficiently compare latitude and longitude, Redis adopts the widely used GeoHash encoding method in the industry, which is based on the principle of “binary range and interval encoding”.

When we need to encode a set of latitude and longitude with GeoHash, we first encode longitude and latitude separately, and then combine the encoded values of longitude and latitude into a final encoding.

First, let’s take a look at the separate encoding process for longitude and latitude.

For a geographic location, the longitude range is [-180, 180]. GeoHash encoding will encode a longitude value into an N-bit binary value. Let’s perform N intervals on the longitude range [-180, 180], where N can be customized.

During the first binary interval, the longitude range [-180, 180] is divided into two sub-intervals: [-180, 0) and [0, 180] (referred to as the left and right intervals). At this point, we can check whether the longitude value to be encoded falls into the left or right interval. If it falls into the left interval, we represent it with 0; if it falls into the right interval, we represent it with 1. In this way, after each binary interval, we can obtain a 1-bit encoding value.

Then, we perform another binary interval on the interval to which the longitude value belongs, and check whether it falls into the left or right interval after the binary interval. According to the previous rule, we perform a 1-bit encoding again. After performing N binary intervals, the longitude value can be represented by an N-bit number.

For example, let’s say we want to encode a longitude value of 116.37 using a 5-bit encoding value (i.e., N=5, perform 5 intervals).

During the first binary interval, the longitude range [-180, 180] is divided into the left interval [-180, 0) and the right interval [0, 180]. At this point, the longitude value 116.37 belongs to the right interval [0, 180], so we represent it with 1 after the first binary interval.

Next, we perform another binary interval on the interval [0, 180] to which the longitude value 116.37 belongs, dividing it into [0, 90) and [90, 180]. At this point, the longitude value 116.37 still belongs to the right interval [90, 180], so the encoding value after the second interval is still 1. When we perform the third interval on [90, 180], the longitude value 116.37 falls into the left interval [90, 135), so the encoding value after the third interval is 0.

Following this method, after performing 5 intervals, we locate the longitude value 116.37 in the interval [112.5, 123.75] and obtain the 5-bit encoding value of the longitude value, which is 11010. The encoding process is shown in the following table:

The encoding method for latitude is the same as that for longitude. The only difference is that the latitude range is [-90, 90]. The following table shows the encoding process for the latitude value 39.86.

When a set of coordinate values have been encoded, we combine their respective encoded values together according to the following rule: the even positions of the final encoded value are the longitude encoded values, and the odd positions are the latitude encoded values. The even positions start at 0, and the odd positions start at 1.

For example, the respective encoded values of the latitude and longitude (116.37, 39.86) we just calculated are 11010 and 10111. After combining them, the 0th position is the 0th position of the longitude 1, the 1st position is the 0th position of the latitude 1, the 2nd position is the 1st position of the longitude 1, the 3rd position is the 1st position of the latitude 0, and so on. . This gives us the final encoded value 1110011101, as shown in the following image:

By using GeoHash encoding, a set of latitude and longitude values (116.37, 39.86) that cannot be represented by a single weight score can be represented by a single value 1110011101, which can be saved as the weight score of a Sorted Set.

Of course, by using GeoHash encoding, we essentially divide the entire geographic space into grids, each corresponding to a partition in GeoHash.

For example, let’s divide the longitude interval [-180, 180] and the latitude interval [-90, 90] into two partitions, and we will get 4 partitions. Let’s take a look at their longitude and latitude ranges and the corresponding GeoHash combined encoding.

Partition 1: [-180, 0) and [-90, 0), encoding 00;
Partition 2: [-180, 0) and [0, 90], encoding 01;
Partition 3: [0, 180] and [-90, 0), encoding 10;
Partition 4: [0, 180] and [0, 90], encoding 11.

These 4 partitions correspond to 4 grids, each covering a certain range of latitude and longitude values. The more partitions, the smaller the geographic space each grid can cover, and the more accurate it is. When we map the encoded values of all grids to one-dimensional space, the GeoHash encoded values of neighboring grids are also close, as shown in the following image:

Therefore, the neighboring encoded values obtained by using range queries on a Sorted Set are also neighboring grids in the actual geographic space. This allows us to implement the “search for nearby people or objects” function in LBS applications.

However, I want to remind you that although some encoded values are close in size, the corresponding grids may be far apart in reality. For example, if we use 4 bits for GeoHash encoding and divide the longitude interval [-180, 180] and latitude interval [-90, 90] into 4 partitions each, a total of 16 partitions and 16 grids are obtained. The two grids with the encoded values 0111 and 1000 are relatively far apart, as shown in the following image:

Therefore, to avoid inaccurate queries, we can query the 4 or 8 grids around the given latitude and longitude.

So far, we have learned that the GEO type represents the encoding of latitude and longitude intervals as the weight scores of elements in a Sorted Set and saves the vehicle IDs related to latitude and longitude as the values of the elements themselves in the Sorted Set. The querying of adjacent latitude and longitude can be achieved through range queries based on the encoding values. Next, let’s talk about how to operate the GEO type specifically.

How to operate the GEO type? #

When using the GEO type, we often use two commands, GEOADD and GEORADIUS.

The GEOADD command is used to add a group of latitude and longitude information and their corresponding ID to a GEO type set.
The GEORADIUS command will find other elements within a certain range around the input latitude and longitude position. Of course, we can define this range ourselves.

I will still use the scenario of matching vehicles in a ride-hailing application as an example to explain how to use these two commands specifically.

Assuming the vehicle ID is 33 and the latitude and longitude position is (116.034579, 39.030452), we can use a GEO set to save the latitude and longitude of all vehicles. The set key is “cars:locations”. By executing the following command, we can store the current latitude and longitude position of the vehicle with ID 33 in the GEO set:

GEOADD cars:locations 116.034579 39.030452 33

When users want to find nearby ride-hailing cars, the LBS application can use the GEORADIUS command.

For example, when the LBS application executes the following command, Redis will find the vehicle information within a 5 km radius of the input latitude and longitude (116.054579, 39.030452) and return it to the LBS application. Of course, you can modify the “5” parameter to return vehicle information within a larger or smaller range.

GEORADIUS cars:locations 116.054579 39.030452 5 km ASC COUNT 10

In addition, we can further limit the returned vehicle information.

For example, we can use the ASC option to sort the returned vehicle information by distance from the central position, from nearest to farthest, to facilitate selecting the nearest vehicle. We can also use the COUNT option to specify the number of returned vehicle information. After all, there may be many vehicles within a 5 km range, and returning all the information will consume a lot of data bandwidth. This option can help control the amount of data returned and save bandwidth.

As you can see, using the GEO data type makes it very easy to operate on latitude and longitude information.

Although we have 5 basic data types and 3 extended data types, in some scenarios, we may have special requirements for data types. For example, we need a data type that supports fast single-key queries like Hash and supports range queries like Sorted Set. In this case, the data types we have learned so far cannot meet the requirements. Next, I will introduce to you the ultimate version of Redis, the custom data type. With this, you can customize data types that meet your requirements, so you don’t have to worry about not having the right data type no matter how your application scenario changes.

How to customize data types? #

To implement a custom data type in Redis, firstly, we need to understand the basic object structure of Redis, the RedisObject, because every value in a Redis key-value pair is stored using a RedisObject.

As mentioned in [Lesson 11], the RedisObject consists of metadata and a pointer. One of the functions of metadata is to differentiate different data types, while the pointer is used to point to the value of the specific data type. Therefore, to develop a new data type, let’s first understand the metadata and pointer of RedisObject.

The basic object structure of Redis #

The internal structure of RedisObject consists of 4 metadata: type, encoding, lru, and refcount, and a *ptr pointer.

type: represents the type of the value, including the five basic types we have learned before;
encoding: is the encoding method of the value, representing the underlying data structure used to implement various basic types in Redis, such as SDS, compressed list, hash table, skip list, etc.;
lru: records the last access time of the object, used to eliminate expired key-value pairs;
refcount: records the reference count of the object;
*ptr: is a pointer to the data.

RedisObject can use *ptr to point to different data types. For example, if *ptr points to an SDS or a skip list, it means that the value in the key-value pair is of type String or Sorted Set. Therefore, after defining a new data type, we only need to set the type and encoding of the new type in RedisObject, and use *ptr to point to the implementation of the new type.

Developing a new data type #

After understanding the RedisObject structure, defining a new data type is not difficult. First, we need to define the underlying structure, type, and encoding attributes of the new data type, and then implement the creation, release functions, and basic commands of the new data type.

Next, I will use an example of developing a new data type named NewTypeObject to explain the specific four steps.

Step 1: Define the underlying structure of the new data type

We use a file called newtype.h to store the definition of this new type. The specific definition is as follows:

struct NewTypeObject {
    struct NewTypeNode *head;
    size_t len;
} NewTypeObject;

where NewTypeNode is the underlying structure of our custom new type. We design two member variables for the underlying structure: a Long type value to store the actual data, and a *next pointer to point to the next NewTypeNode structure.

struct NewTypeNode {
    long value;
    struct NewTypeNode *next;
};

From the code, we can see that the underlying structure of NewTypeObject is actually a singly linked list of Long data types. Of course, you can also design the underlying structure of NewTypeObject as other types according to your needs. For example, if we want the query efficiency of NewTypeObject to be higher than that of a linked list, we can design its underlying structure as a B+ tree.

Step 2: Add the definition of this new type to the type attribute of RedisObject

This definition is in the server.h file of Redis. For example, we add a macro definition called OBJ_NEWTYPE to represent the NewTypeObject in the code.

#define OBJ_STRING 0    /* String object. */
#define OBJ_LIST 1      /* List object. */
#define OBJ_SET 2       /* Set object. */
#define OBJ_ZSET 3      /* Sorted set object. */
…
#define OBJ_NEWTYPE 7

Step 3: Develop the creation and release functions of the new type

Redis defines the creation and release functions of data types in the object.c file. So, we can add the createNewTypeObject function for the NewTypeObject in this file as follows:

robj *createNewTypeObject(void){
   NewTypeObject *h = newtypeNew();
   robj *o = createObject(OBJ_NEWTYPE, h);
   return o;
}

createNewTypeObject calls two functions: newtypeNew and createObject. Let me introduce them separately.

First, let’s talk about the newtypeNew function. It is used to initialize the memory structure of the new data type. This initialization process mainly allocates space for the underlying structure using zmalloc so that data can be written into it.

NewTypeObject *newtypeNew(void){
    NewTypeObject *n = zmalloc(sizeof(*n));
    n->head = NULL;
    n->len = 0;
    return n;
}

The newtypeNew function is related to the specific creation of the new data type. In Redis, each data type is usually defined in a separate file to implement the creation and command operations for this type. For example, t_string.c and t_list.c correspond to the String and List types, respectively. According to Redis conventions, we define the newtypeNew function in a file named t_newtype.c.

createObject is a Redis-provided function for creating RedisObjects. Its parameters are the type of the data type and a pointer to the implementation of the data type.

We pass two parameters to the createObject function: the type value OBJ_NEWTYPE of the new type and a pointer to an initialized NewTypeObjec. This way, the created RedisObject can point to our custom new data type.

robj *createObject(int type, void *ptr) {
    robj *o = zmalloc(sizeof(*o));
    o->type = type;
    o->ptr = ptr;
    ...
    return o;
}

As for the release function, it is the reverse process of the creation function, which frees the memory space of the new structure using the zfree command.

Step 4: Develop the command operations of the new type

In short, the process of adding corresponding command operations can be divided into three steps:

Add the implementation of command operations in the t_newtype.c file. For example, we define the ntinsertCommand function, which implements the insertion operation on the NewTypeObject linked list:

void ntinsertCommand(client *c){
  // Insert elements into the NewTypeObject linked list based on the parameters passed by the client
}

In the server.h file, declare the command we have implemented so that it can be referenced in the server.c file, for example:

void ntinsertCommand(client *c){
  // Insert elements into the NewTypeObject linked list based on the parameters passed by the client
}

In the server.c file, associate the newly added command with the implementation function in the redisCommandTable. For example, the newly added ntinsert command is implemented by the ntinsertCommand function, so we can use the ntinsert command to insert elements into the NewTypeObject data type.

struct redisCommand redisCommandTable[] = {
...
{"ntinsert",ntinsertCommand,2,"m",...}
}

At this point, we have completed a custom NewTypeObject data type, which can implement basic command operations. Of course, if you want the new data type to be persisted, we also need to add code in the RDB and AOF modules of Redis to persistently save the new data type. I will share this in a future post.

Summary #

In this lesson, we learned about the Redis extension data type GEO. GEO can record geographical location information in the form of latitude and longitude, and is widely used in LBS (Location-Based Service) applications. GEO itself does not have a new underlying data structure, but directly uses the Sorted Set collection type.

The GEO type uses the GeoHash encoding method to convert latitude and longitude to the weight score of elements in the Sorted Set. Two key mechanisms involved are dividing the two-dimensional map into intervals and encoding the intervals. After a group of latitude and longitude falls within a certain interval, the encoding value of the interval is used to represent it, and the encoding value is used as the weight score of the Sorted Set element. In this way, we can save latitude and longitude in the Sorted Set, and utilize the “ordered range lookup by weight” feature provided by Sorted Set to implement the frequent “search nearby” requirement in LBS services.

GEO belongs to the extension data types provided by Redis. There are two approaches to implementing extension data types: one is based on existing data types, using data encoding or implementing new operations, such as using Sorted Set and GeoHash encoding to implement GEO, or using String and bit operations to implement Bitmap; the other is to develop custom data types, which involves adding definitions for the new data type, implementing creation and release functions, and implementing command operations supported by the new data type. I encourage you to try applying the knowledge you learned today flexibly to your work scenarios.

One question per lesson #

Up to now, we have already learned about the 5 basic data types and 3 extended data types in Redis. I would like to ask you, have you used any other data types in your daily practice with Redis?

Feel free to share in the comments section the other data types you have used. Let’s exchange and learn together. If you have friends who are interested in developing new data types in Redis, I hope you can share today’s content with them. See you in the next lesson.