08 Distributed Indexing How to Quickly Locate Data in a Cluster

08 Distributed Indexing - How to Quickly Locate Data in a Cluster #

Indexes are a key technology for data retrieval. However, in a distributed database with large data volumes, it is impractical to perform full table scans like in a single machine database. Therefore, the key to a distributed storage engine is to locate target data through index lookup.

Since the concept of indexes varies significantly in different databases, this lecture will first define the concept of indexes we are going to discuss. Then it will describe the read path of a database, from which we can observe the usage patterns of primary indexes. It will also focus on discussing the index structures on disk and in memory. Finally, it will talk about secondary indexes, which are the meaning and main implementation forms of indexes other than primary keys.

So, let’s start with what a distributed index is.

What Are We Talking About When We Mention Distributed Indexes? #

First of all, I want to explain what kind of content is involved when we talk about distributed indexes. Based on the knowledge from the previous lecture, you already know that a storage engine consists of data files and index files, and index files usually take the form of index-organized tables. The data storage formats of most distributed databases in the world today are designed around indexes.

Why is this the case?

Since data in a distributed database is distributed across multiple nodes, when a query request reaches the server, there is a high probability that the target data is not on that node, and one or even multiple remote calls are required to retrieve the data. For this reason, when designing a distributed database storage engine, we prefer to use data tables with indexes in order to reduce query latency.

This implies that most of the scenarios in distributed databases are for query services. The database sacrifices some write performance by generating index structures when storing data. Therefore, the core of a distributed database is to provide data retrieval services, and data writes should serve data queries. In this sense, distributed indexes are the primary form of data storage.

This lecture will use NewSQL and Cassandra as examples to introduce the main techniques used in typical NoSQL storage engines, aiming to help you understand the data retrieval path in these databases.

Read Path #

To master a distributed database storage engine, it is generally necessary to understand both the write path and the read path. However, as discussed earlier, writes heavily rely on reads, so by clarifying the read path, we can also indicate the rules for writes.

Therefore, in this part, let’s first clarify how the storage engine handles query requests. The general rules are as follows:

  1. Find shards and target nodes;
  2. Check if the data is in cache and buffers;
  3. Check if the data is in disk files;
  4. Merge the results.

The first step is to find on which target node the data is in the distributed system. Strictly speaking, this step is not part of the storage engine, but for clarity, we include it in the read path. Since distributed databases use sharding techniques to distribute data, if the query condition includes the shard key, the sharding algorithm can be applied to calculate the shard, which is the location of the target node. If the shard key is not included, we need the help of secondary indexes to find the shard key, and then the logic is similar to looking up by shard key.

The second step, once the node is determined, is to let the storage engine handle the rest. First, we need to search in the cache. The cache includes data cache or row cache, which contains the actual data used for quick retrieval of frequently accessed data. Generally, metadata and static configuration data are stored in the data cache. Then we search in the buffers, which are reserved memory spaces for batch writing. When the buffers are filled, the data will be flushed to disk. Therefore, some data may exist in the buffers.

The third step is to check the disk if the data is not in memory. We need to search for the corresponding data in the indexed data files. Based on our previous learning, each data file has a primary key index, which allows direct data lookup. However, for write performance, the storage engine splits the data into multiple data files. Therefore, we need to search for data in a series of files. Even with the boost from indexes, the speed of searching is not satisfactory. At this time, we can introduce Bloom filters to quickly locate the target files and improve query efficiency.

The last step is result merging. Depending on the requirements of the execution layer, partial matching results can be returned immediately, or all results can be returned at once. Now we have outlined a complete read path for the storage engine, and we can see that there are some key technologies on the path to ensure data querying and reading. Below, we will introduce the key technologies involved.

Indexing Data Tables #

As I mentioned earlier, data tables with indexes include index-organized tables and hash-organized tables. In fact, the most common in distributed databases is the SSTable (Sorted String Table) mentioned in Google’s BigTable paper.

The original description in the Google paper is: SSTable is used for internal data storage in BigTable. SSTable files are sorted, immutable, persistent key-value pair structures, where the key-value pairs can be arbitrary byte strings. They support searching for values using a specific key or iterating through all key-value pairs within a given key range. Each SSTable file contains a series of blocks. The block index in the SSTable file (which is usually stored in the file’s footer) is used to locate blocks, and this block index is loaded into memory when the SSTable file is opened. During a lookup, the block is first found by binary search in the in-memory index, and then the corresponding block can be read with a single disk seek. Another way is to load the entire SSTable file into memory, so that reading from disk is not required during lookup and scanning.

From the above description, we can see that these key-value pairs are sorted by the key and are immutable once written. The data engine supports querying based on specific keys or performing range scans. The index is a sparse index, which only locates data blocks. Once the block is found, it needs to be sequentially scanned internally to obtain the target data.

Below is the SSTable structure of RocksDB. We can see that the data is placed in the front, and the index is placed as metadata at the end of the file. Even the index of the metadata is placed at the end of the entire metadata structure.

<beginning_of_file>

[data block 1]

[data block 2]

...

[data block N]

[meta block 1: filter block]

[meta block 2: index block]

[meta block 3: compression dictionary block]

[meta block 4: range deletion block]

[meta block 5: stats block]

...

[meta block K: future extended block]

[metaindex block]

[Footer]
    <end_of_file>

Of course, the implementation of SSTable does not necessarily use a single file. Different storage engines will adopt different strategies to implement it. Some use a single file, as described in the BigTable paper, where data is placed at the beginning of the file and the index is placed at the end of the file. Alternatively, data and index can be separated and placed in different files.

The data is stored in order of the keys, so regardless of how the index is implemented, the data file itself supports range scans. Even if an unordered hash table is used, the data section can still support range scans.

It is important to note that SSTable is immutable, meaning that once inputted, it cannot be changed. Modifications and deletions are generally done by writing new data. This requires a process called compaction, which combines operations on the same data into the final result. This process is similar to the recovery process faced by databases after a crash, where the basic idea is to replay logs and merge them. We will provide a detailed explanation of SSTable operations in the introduction to LSM trees.

Of course, there are other ways to implement index data tables besides SSTable. Those familiar with database indexes should know that the B-tree family plays a crucial role in the field of indexing. The reason is that each node of a B-tree can contain multiple data, allowing for a balance between height and width, effectively reducing disk seek times.

However, updating a B-tree is very costly. Distributed databases, in order to achieve efficient writing, adopt a series of optimization techniques to improve the efficiency of updating B-trees. Here, we will take MongoDB's WiredTiger storage engine as an example to introduce one of these optimization techniques.

This optimization technique involves caching recent operations on the index before persisting them to disk. WiredTiger uses a B-tree to store data. In the memory pages, B-tree nodes are accompanied by a modification buffer that contains a reference to the original data on the disk. Later, in the read process, the original disk data is combined with the memory buffer data and returned to the user. The advantage of this approach is that data flushing and memory page updating are done by background threads, so read and write operations are not blocked.

The logic behind the implementation of the two types of data tables with index properties mentioned above can be seen as the key to improving write speed. Either data is written in sequential form or random writes are cached and transformed into sequential writes.

Both types of data tables mentioned above include buffer structures in memory to deal with the speed difference between memory and disk devices. The data structures used will be discussed in detail later in this lecture.

Now let's take a look at memory buffering.

### Memory Buffering

Currently, there are many different data structures that can be used to store ordered data in memory. In the storage engine of a distributed database, there is a structure that is widely used because of its simplicity - the skip list.

The advantage of the skip list lies in its implementation difficulty, which is not much higher than that of a simple linked list, but its time complexity can approach that of a balanced search tree.

The skip list avoids rotating or replacing nodes when inserting or updating, instead it uses the concept of probabilistic balancing to keep the whole list balanced. The skip list is made up of a series of nodes, each composed of different heights. Accessing nodes with higher heights allows us to skip nodes with lower heights, somewhat similar to how Spider-Man uses tall buildings to move quickly through the city, hence the name "skip list". Now let's go through the algorithm details of the skip list with an example. Please refer to the image below.

![Drawing 0.png](../images/CioPOWAc-pCAW-h1AACAT7yvNXU780.png)

Let's use finding 15 as an example to illustrate the search order of the skip list.

   1. First, we look for the node with the highest height in the skip list. From the image, we can see it is 10.
   2. The target node 15 is greater than 10, so from the current height, which is the highest height, we move forward but there are no nodes. At this point, we need to lower one height.
   3. After lowering the height, we find node 22, which is greater than 15. At this point, we go back to node 10 and continue to lower the height.
   4. Now we have reached the lowest height and successfully found node 15.

If nodes need to perform insertions, deletions, and modifications, the tree needs to be balanced. At this time, the nodes need to be moved at different heights, and the height will also change with the number of nodes. How do we determine the amount of change? The answer is actually quite simple: use random numbers to determine these variables. Although random numbers do not strictly distribute data evenly, they can achieve relatively uniform distribution with low cost. This is why this algorithm is widely used: it achieves good results at a relatively low cost.

The above is a commonly used data structure for fast searching in memory. Now, how do we determine in which disk file the data resides? The answer is by using a Bloom filter.

Bloom Filters #

The content discussed above includes how to search for data in data files and data buffers. In the querying path, we mentioned that in addition to querying all data files (also known as read amplification), we can also use a Bloom filter to quickly locate the target data file.

The principle of a Bloom filter is as follows: we have a very large bit array, initialize all the values inside to 0, and then hash the keys in the data, mapping the binary representation of the result to this bit array, turning some 0s into 1s. Then we map all the keys in the data table in the same way.

During a search, we hash the query condition key in a similar way, and then compare the 1s with the array. If there is a match, it means that the key may be in this data table.

As you can see, this algorithm is an approximate algorithm and there is a possibility of false positives. In other words, all positions can be 1, but the key may not be in the data table, and these 1s are generated by other keys.

However, in the scenario of searching for data files, this flaw can be ignored. Because if the Bloom filter fails to match, it just wastes some time searching in the data table, which degrades into a read amplification scenario without causing misreads.

The principle of a Bloom filter is simple and easy to understand. It is very useful for retrieval of a large number of SSTables generated in an LSM tree storage engine and is an important means of optimizing queries.

Secondary Index #

All the querying methods I mentioned above are based on primary key indexes, but in real scenarios, non-primary keys are often used as query conditions. This is where the concept of a secondary index comes in.

A secondary index is generally a sparse index, which means that the index is separate from the data. The result of the index usually stores the primary key, and then the data is retrieved based on the primary key. In distributed scenarios, this has obvious performance issues because the node where the index result resides is very likely not on the same node as the data.

One feasible solution to the above problem is to scatter the index data based on the secondary index result (i.e. the primary key) so that when the data table is created, the secondary index is created at the same time. Apache Cassandra’s SASI is a good example in this regard. It is bound to the lifecycle of an SSTable and is created when the memory cache is flushed or when data is merged. To a certain extent, this gives sparse indexes some affinity.

If a key-value pair is used to implement a secondary index, the index result can be combined in the following ways:

  1. Eager mode: quickly merge the index results into one value, and then a single query can find all the results.
  2. Normal mode: use multiple key-value pairs to preserve the data.
  3. Key combination mode: put both the index and result on the key, with an empty value.

Overall, the read performance of these three modes is similar, but the write performance of the eager mode is slightly lower. However, the performance may differ depending on the underlying implementation of the key-value store. For example, the key-value separation mode implemented by Wisckey (which will be introduced in Chapter 11) makes sense to use the combination mode. At the same time, because the key combination mode is simple and suitable for implementing key scanning algorithms, it is a common form of secondary index.

Summary #

That’s all for this lecture. In this lecture, we first introduced the concept of distributed indexes, which is actually a general term for all the technologies used to store data in distributed database storage engines. Then I introduced the querying path of storage engines, helping you establish a comprehensive concept of how storage engines handle queries. Finally, I discussed several key technologies that affect the querying path and provided practical examples.