11 Storage Systems Where Is the Data Actually Stored

11 Storage Systems - Where is the Data Actually Stored #

Hello, I am Wu Lei.

Thank you for continuing to study during the National Day holiday. In today’s lesson, we will learn about the storage system. Just like the scheduling system, it is also an important infrastructure of Spark. However, you may wonder, “Do we need to understand such low-level knowledge to master Spark application development?” To be honest, yes we do. Why do I say that?

In the previous lessons, we learned about Shuffle management, RDD cache, and broadcast variables. These features and characteristics have a crucial impact on the performance of Spark jobs. And to achieve these functionalities, the underlying support system is the Spark storage system.

Learning and understanding the storage system is not only to improve our knowledge system, but it can also directly help you better utilize the features of RDD cache and broadcast variables. In the future, this knowledge will also lay a solid foundation for you to optimize Shuffle.

Since the storage system is so important, how can we efficiently and quickly master it? Following the principle of learning by applying, we need to first understand the service objects of the system, which means understanding what the storage system is used to store.

Target Audience #

In general, the Spark storage system is responsible for maintaining all data temporarily stored in memory and disk, including Shuffle intermediate files, RDD Cache, and broadcast variables.

We are familiar with these three types of data. Let’s first review what Shuffle intermediate files are. During the Shuffle computation process, the Map Task produces data and index files in the Shuffle Write phase. Next, based on the partition index provided by the index file, the Reduce Task in the Shuffle Read phase pulls its own partition data from different nodes. The Shuffle intermediate files refer to the data and index files relied upon to complete the data exchange in these two stages.

RDD Cache refers to the materialization of distributed datasets in memory or disk, which often helps improve computing efficiency. As we mentioned in the previous lecture, broadcast variables [lecture 1] have the advantage of distributing shared variables at the granularity of Executors, greatly reducing the network and storage overhead introduced by data distribution.

We have just briefly reviewed these three types of data. If you are not particularly clear about any part, you may want to go back and review the previous lectures. We provided detailed explanations of them in lectures 7, 8, and 10, respectively. Now that we understand the main components of the storage system service, let’s take a closer look at the important components of the Spark storage system and how they collaborate.

Composition of Storage System #

The study of theory is always dry and boring. In order to help you understand the core components of the storage system more easily, let’s use the analogy of Spark International Construction Group to explain the Spark storage system.

Compared with the complex organizational relationships in the scheduling system (Dag, Task, Bacon), the personnel composition in the storage system is much simpler. In the discussion of memory management, we regarded the memory of the nodes as the construction site, and the disk of the nodes as the temporary warehouse. Therefore, the component that manages data storage can be regarded as the warehouse manager, abbreviated as “BlockManager”.

The Brookes Family #

In the Spark Construction Group, the crucial role of the warehouse manager has always been held by the Brookes family.

The Brookes family has a pivotal position in the Spark group. Old Brookes (BlockManagerMaster) is in charge of the group’s headquarters (Driver), and his offspring, Little Brookes (BlockManager), are stationed at various branch offices (Executors).

Old Brookes is well aware of the overall situation of the company’s building materials and warehouses, thanks to his numerous offspring. The Little Brookes from each branch office compete to report the status of building materials and warehouse conditions to their father. I have compiled their parent-child relationship into the following diagram:

From the above diagram, we can see that the information exchange between Little Brookes and Old Brookes is bidirectional. It is not difficult to see that the Brookes family has a typical “patriarchy” and “autocracy” style. If Little Brookes needs to obtain the status of other branch offices, he must get this information through Old Brookes.

In the previous discussions, we compared building materials to distributed data sets. Therefore, the information exchanged between BlockManagerMaster and BlockManager is actually the status of data above Executors. Speaking of this, you may ask, “Since the information of BlockManagerMaster all comes from BlockManager, where does BlockManager obtain this information from?” To answer this question, we need to start from the responsibilities of BlockManager.

As we mentioned earlier, the storage system serves three objects: Shuffle intermediate files, RDD cache, and broadcast variables. The responsibility of BlockManager is to manage the storage, reading, writing, and transmission of these three types of data in Executors. In terms of storage media, these three types of data consume different hardware resources.

Specifically, Shuffle intermediate files consume node disks, broadcast variables mainly occupy node memory space, and RDD cache “straddles two boats,” being able to consume both memory and disk resources.

Regardless of whether it is in memory or on disk, these data are accessed and retrieved at the granularity of data blocks. The concept of data blocks is consistent with RDD data partitions—the context of RDD. When discussing the granularity of data partitioning in RDD, we often refer to a piece of data as a “data partition.” In the context of the storage system, we refer to a fine-grained piece of data as a “data block.”

With the concept of data blocks, we can further refine the responsibilities of BlockManager. The core responsibility of BlockManager is to manage the metadata of data blocks. These metadata records and maintain the address, location, size, and status of data blocks. To let you intuitively understand the metadata, I have put an example in the diagram below for you to view.

Only with the help of metadata can BlockManager efficiently complete the storage, retrieval, transmission, and reception of data. This answers the question I raised earlier—all the information related to the data status of BlockManager comes from the management of metadata. The next question is, based on this metadata, how does BlockManager complete the storage and retrieval of data?

Whether on a construction site or in a warehouse, these places are dusty and full of people. For tasks such as storing and retrieving building materials, Little Brookes, who prefers an elegant lifestyle, will naturally not do it himself. So, he recruited two assistants to help him with these dirty and tiring tasks.

These two assistants are not strangers. One is his cousin, MemoryStore, and the other is his cousin, DiskStore. As the names imply, MemoryStore is responsible for storing and retrieving data in memory, while DiskStore is responsible for accessing data on disk.

Alright, up to this point, the important roles of the storage system have all been introduced. I have organized them into the table below. Next, we will discuss how MemoerStore and DiskStore help Little Brookes manage data using the examples of RDD cache and Shuffle intermediate files.

MemoryStore: Accessing Data in Memory #

Big cousin MemoryStore is very capable and organized. To live up to Little Brookes’ trust, MemoryStore carries a little booklet with detailed information about data blocks. This booklet is a special data structure called LinkedHashMap[BlockId, MemoryEntry]. As the name implies, LinkedHashMap is a type of map, where the key-value pairs consist of BlockId and MemoryEntry.

BlockId is used to mark the identity of a block. It’s worth noting that BlockId is not just a string that records the ID, but a data structure that records the metadata of a block. The BlockId data structure records a wealth of information, including the block name, the corresponding RDD, the RDD data partition corresponding to the block, whether it is a broadcast variable, whether it is a Shuffle Block, and so on.

MemoryEntry is an object used to hold data entities, which can be a data partition of an RDD or a broadcast variable. The MemoryEntry stored in the LinkedHashMap is like an address to the data entity.

It is not difficult to see that BlockId and MemoryEntry together are like a household register, which records all the metadata necessary for accessing a data block, such as “resident name”, “affiliated police station”, and “home address”. Based on these metadata, we can access data blocks precisely and strategically, just like “checking household registration”.

val rdd: RDD[_] = _
rdd.cache
rdd.count

Taking RDD caching as an example, when we use the above code to cache an RDD, Spark will help us do the following three things in the background. I have summarized this process in the following diagram.

Calculate the RDD execution result at the granularity of data partitions, generating corresponding data blocks;
Wrap the data blocks in MemoryEntry and create the metadata BlockId for the data blocks;
Add the key-value pair (BlockId, MemoryEntry) to the “booklet” LinkedHashMap.

As the RDD caching process progresses, the elements in the LinkedHashMap will accumulate. When the booklet maintained by Meimei records the complete information, Spark can access and read each data block through the “household register” on the booklet.

DiskStore: Disk Data Access #

After talking about Big Sister, let’s talk about Big Brother Disk. Disk’s main responsibility is to maintain the correspondence between data blocks and disk files, enabling access to disk data. Compared to Big Sister’s meticulous and hands-on approach, Disk is much cleverer and, like Brooke, he is also a hands-off manager.

Seeing Big Sister staring at her “booklet” day and night, Disk doesn’t want to blindly work for Brooke, so he recruits an assistant: DiskBlockManager to help him maintain metadata.

With the help of DiskBlockManager, Disk can hum a tune, drink coffee, and sit at the warehouse door to receive the coming and going construction workers. Some workers have goods, some pick up goods, but whatever they do, Disk will unify and send them to DiskBlockManager, letting DiskBlockManager tell them which shelf and which layer the goods are on.

The assistant DiskBlockManager is a class object. Its getFile method takes BlockId as a parameter and returns the disk file. In other words, given a data block, to find out which disk file it is stored in, you need to call the getFile method to get the answer. With the mapping between data blocks and files, we can easily access the data on disk.

Taking Shuffle as an example, in the Shuffle Write phase, each task generates an intermediate file. Each intermediate file includes a data file with a “data” suffix and an index file with an “index” suffix. For each file, we can use the getFile method of DiskBlockManager to obtain the corresponding disk file, as shown in the following figure.

As we can see, the process of getting the data file and the index file is exactly the same. They both use BlockId to call the getFile method to complete data access.

Key Review #

In today’s lecture, we focused on the Spark storage system. Regarding the storage system, you need to first know that RDD Cache, Shuffle intermediate files, and broadcast variables are the three main service objects of the storage system.

Next, we introduced the core components of the storage system, which are the BlockManagerMaster located on the Driver side, and the BlockManager, MemoryStore, and DiskStore “residing” on the Executors. The BlockManagerMaster exchanges information with multiple BlockManagers through heartbeats, including data block addresses, locations, sizes, and statuses, etc.

In the Executors, the BlockManager completes memory data storage and retrieval through the MemoryStore. The MemoryStore uses a special data structure, LinkedHashMap, to map BlockId to MemoryEntry. BlockId records the metadata of the data blocks, while MemoryEntry is used to encapsulate the data entities.

At the same time, the BlockManager utilizes the DiskStore to implement disk data storage and access. The DiskStore does not directly maintain a metadata list, but instead uses the DiskBlockManager object to map from the database to disk files, thereby completing data access.

Practice for Each Lesson #

LinkedHashMap is a very special data structure. In today’s lesson, we only introduced its functionality in the context of Map. You can try to summarize the characteristics and features of this data structure yourself.

I look forward to seeing your thoughts in the comments section. If this lesson has been helpful to you, I also recommend forwarding it to more colleagues and friends. See you in the next lesson!