03 Why You Must Understand Rdd Elastic Distributed Datasets

03 Why You Must Understand RDD - Elastic Distributed Datasets #

Hello, I am Wu Lei.

Starting today, we will enter the study of principles. I will focus on performance tuning and give you a detailed explanation of the core concepts RDD and DAG in Spark, as well as important components such as the scheduling system, storage system, and memory management. In this lesson, let’s talk about RDD.

RDD can be said to be the most fundamental concept in Spark. Developers using Spark are probably familiar with RDD, and you may have heard about RDD so much that your ears are tired. However, with the evolution and development of Spark’s development API, most beginners now start development with the DataFrame or Dataset API. Therefore, many beginners may think, “Since RDD API is hardly used anymore, I don’t need to understand what RDD really is.”

Is it really the case? Of course not.

Why is RDD so important #

First of all, RDD is the cornerstone of Spark’s distributed data model abstraction and is essential for building the Spark distributed in-memory computing engine. Many core concepts and components of Spark, such as DAG and scheduling system, derive from RDD. Therefore, gaining a deep understanding of RDD is beneficial for you to learn the working principles of Spark more comprehensively and systematically.

Secondly, although the usage of RDD API is decreasing, and most people have already become accustomed to DataFrame and Dataset API, regardless of which API or programming language you adopt, your application in Spark will ultimately be transformed into distributed computing based on RDD. In other words, if you want to identify the performance bottlenecks of your application at runtime, it is a prerequisite to have a sufficient understanding of RDD. Remember? Identifying performance bottlenecks is the first step in Spark performance tuning.

Moreover, not fully understanding RDD can lead to potential performance pitfalls. Next, we will start with a counterexample to analyze this.

Do you remember the counterexample we discussed in Lesson 1 about data filtering and aggregation? Through this example, we understood the necessity of performance tuning. What is the relationship between this example and RDD?

Don’t worry, let’s first review the code implementation in this case to explore the underlying reasons why developers adopt this implementation approach.

//Implementation approach 1 -- Counterexample
def createInstance(factDF: DataFrame, startDate: String, endDate: String): DataFrame = {
    val instanceDF = factDF
        .filter(col("eventDate") > lit(startDate) && col("eventDate") <= lit(endDate))
        .groupBy("dim1", "dim2", "dim3", "event_date")
        .agg("sum(value) as sum_value")
    instanceDF
}

pairDF.collect.foreach {
    case (startDate: String, endDate: String) =>
        val instance = createInstance(factDF, startDate, endDate)
        val outPath = s"${rootPath}/endDate=${endDate}/startDate=${startDate}"
        instance.write.parquet(outPath)
}

In this code snippet, the main logic of createInstance is to filter the factDF based on time conditions and return the aggregated business statistics, and then the pairDF iterates through each pair of start time and end time, repeatedly calling createInstance to obtain the aggregated results and writing them to disk. As we analyzed in Lesson 1, the main problem with this code is that the factDF, which contains millions of rows, is scanned hundreds of times, and it is a full scan, which ultimately affects the overall execution performance.

So we can’t help but ask: Why did the developer come up with such an inefficient way to implement the business logic? Or, what internal factors naturally led the developer to adopt this implementation approach?

Let’s step out of Spark and this column and put ourselves in a classroom: the teacher is explaining “XX programming language” in front of the blackboard, and your classmate next to you is listening to the lecture and flipping through the textbook on the desk. Does this scene seem familiar? Does it feel close to home? Think back, are what the teacher taught, and what the textbook teaches, and the code in our example extremely similar?

That’s right! Our brains have become accustomed to for loops, to using functions to process variables and encapsulate calculation logic, and to the procedural programming paradigm. Before distributed computing appeared, this is how we developed, this is how the teacher taught, and this is what the textbook taught, so there’s nothing wrong with it.

Therefore, I believe that the reason why developers choose the above implementation approach is fundamentally because they treat factDF as an ordinary variable, as a parameter of equal importance to startDate and endDate in the createInstance function. They did not realize that factDF is actually a massive distributed dataset spanning all compute nodes, nor did they realize that the outer for loop would cause this massive dataset to be fully scanned repeatedly in a distributed execution environment.

This lack of understanding of distributed computing, and the underlying reasons for it, can be traced back to our insufficient understanding of Spark’s core concept, RDD. So you see, it is still necessary to gain a deep understanding of RDD. Otherwise, having only a superficial understanding of RDD can unintentionally leave potential performance pitfalls in the process of application development.

Understanding RDD in Depth #

Since RDD is so important, what exactly is it? In 2010, on a dark and stormy night, Matei and his colleagues published a paper titled “Spark: Cluster Computing with Working Sets” and first introduced the concept of RDD. RDD stands for Resilient Distributed Datasets, which essentially is an abstraction of the data model used to encompass all distributed data entities in memory and on disk.

If we simply start from theory and stick to the textbook, it would be too dull, boring, and uninteresting! How about I tell you a story first?

Understanding RDD through the Potato Chip Production Process #

A long time ago, there was a workshop that produced potato chips in barrels. The workshop was small in scale and the process was relatively primitive. In order to fully utilize every potato and reduce production costs, the workshop used three assembly lines to produce three different sizes of barrel-packed potato chips simultaneously. Each assembly line could process three potatoes at the same time. The process for each assembly line was the same: cleaning, slicing, baking, distributing, and packaging. Distributing was used to differentiate between small, medium, and large-sized chips, and each size of chip was sent to the first, second, and third assembly lines, respectively. The specific process is shown in the following figure.

As you can see, although the production process of this workshop is simple, it is still methodical. From beginning to end, except for the distribution step, the three assembly lines do not intersect. Before the distribution step, each assembly line works independently, loading the potato ingredients onto the assembly line, then cleaning, slicing, and baking them. After the distribution step, the three assembly lines each package the chips without interfering with each other. The assembly line operation provides strong fault tolerance. If there is an error in one of the processing steps, the workers only need to reload a new potato onto the faulty assembly line to resume production.

Alright, story time is over. If we compare each assembly line to a computing node in a distributed environment and use the potato chip production process to analogize Spark’s distributed computing, what interesting discoveries can we make?

Upon closer observation, we find that the different forms of ingredients on the assembly line, such as freshly dug potatoes, cleaned potatoes, raw potato chips, and baked potato chips, are similar to the abstraction of different data sets in RDD for Spark.

Along the vertical direction of the assembly line, that is, the direction from left to right in the figure, each form of ingredient is processed based on the previous form of ingredient using the corresponding processing method. Each form of ingredient depends on the previous one, which is similar to the dependency relationship recorded in the dependencies attribute of RDD, and the processing method of each step corresponds to the compute attribute of RDD.

Looking from a horizontal perspective, let’s re-examine the potato processing process from top to bottom in the figure, focusing on the 3 muddy potatoes at the beginning of the assembly line. These 3 potatoes, just dug out from the ground, are the raw ingredients and are waiting to be cleaned. As shown in the figure, let’s name this form of ingredient as potatosRDD. Each potato here represents a data partition in RDD, and the 3 potatoes together correspond to the partitions attribute of RDD.

After the muddy potatoes are cleaned, sliced, and baked, they are distributed to the downstream assembly lines based on their sizes. The RDD carried on these three assembly lines is called shuffledBakedChipsRDD. It is obvious that there is a deliberate partitioning of RDD based on size. Quick chips are partitioned into different data partitions based on size. The partitioning rule for data partitions like this corresponds to the partitioner attribute of RDD. In a distributed environment, the partitioner attribute defines how the encapsulated distributed dataset in RDD is divided into data partitions.

In summary, we find that the potato chip production process corresponds exactly to Spark’s distributed computing, with the following six points:

  • Each assembly line in the potato workshop is like a computing node in a distributed environment.
  • Different forms of ingredients, such as muddy potatoes, sliced potatoes, baked potato chips, etc., correspond to RDD.
  • Each form of ingredient relies on the previous one, for example, baked potato chips rely on the previous step of sliced raw potatoes. This dependency relationship corresponds to the dependencies attribute of RDD.
  • The processing methods of different steps correspond to the compute attribute of RDD.
  • The physical items of the same form of ingredient on different assembly lines are the partitions attribute of RDD.
  • The rules for allocating ingredients to each assembly line correspond to the partitioner attribute of RDD. Do you have any analogies from Potato Workshop that can gradually outline the true face of RDD? Before I start, let’s have a serious discussion about RDD.

Core Features and Attributes of RDD #

From the example we just discussed, we know that RDD has four major attributes, which are partitions, partitioner, dependencies, and compute. Because of these four attributes, RDD has two most prominent features: distributed and fault-tolerant. To deeply understand RDD, let’s start with its core features and attributes.

Firstly, let’s talk about partitions and partitioner attributes.

In a distributed computing environment, the data encapsulated by RDD is scattered across the memory or disk of different computing nodes. These scattered data are called “data partitions”. RDD’s partition rule determines which data partitions should be scattered to which nodes. The partitions attribute of RDD corresponds to all the data partitions in the distributed data entity of RDD, while the partitioner attribute defines the partition rule for dividing data partitions, such as hashing or range-based partitioning.

It is not difficult to see that the partitions and partitioner attributes describe the horizontal scaling of RDD across nodes, so we call them the “horizontal attributes” of RDD.

Then, let’s talk about dependencies and compute attributes.

In Spark, every RDD is not generated out of nowhere. Each RDD is transformed from some kind of “data source” based on certain computation logic. RDD’s dependencies attribute records the “data sources” required to generate the RDD, which are called parent dependencies (or parent RDDs), and the compute method encapsulates the computation logic for transforming from the parent RDDs to the current RDD.

Based on the data source and transformation logic, regardless of any differences in the RDD (such as partial data loss caused by node failure), the current RDD can be obtained again by executing the compute method’s computation logic based on the parent RDDs recorded in the dependencies attribute, as shown in the following diagram.

The fault-tolerance provided by the dependencies and compute attributes lays a solid foundation for the stability of Spark’s distributed in-memory computing. This is also the reason why RDD is named “Resilient”. By observing the diagram, it is easy to see that different RDDs are linked together through the dependencies and compute attributes, gradually extending vertically, and constructing a directed acyclic graph (DAG) that becomes deeper and deeper, which is commonly referred to as a DAG.

From this, we can see that the dependencies and compute attributes are responsible for the vertical extension of RDD. Therefore, we can call these two attributes the “vertical attributes” of RDD.

In summary, the four major attributes of RDD can be divided into two categories: horizontal attributes and vertical attributes. The horizontal attributes anchor the data partition entity and define how the data partitions are distributed in the distributed cluster; the vertical attributes are used to construct a DAG in the vertical direction and guarantee the stability of in-memory computing by providing the fault-tolerance capability to rebuild RDD.

In addition, I want to mention a couple more things. In the example at the beginning of this lesson, we analyzed the deep-rooted reasons why developers use the foreach loop to iterate over distributed datasets. I call this programming pattern, which mindlessly jumps straight into procedural programming and ignores or neglects the programming mode of distributed data entities, “single-machine thinking”.

After learning about the horizontal partitions attribute and the vertical dependencies attribute of RDD, if you can keep them in mind, you will naturally think about the dataset it encompasses before frequently calling or referencing this RDD. It is highly possible that this dataset will be scanned and computed repeatedly within the entire cluster. This subconscious reflection will drive you to explore other better implementation methods, thus breaking free from single-machine thinking mode. Therefore, a deep understanding of RDD is also beneficial for breaking free from single-machine thinking mode and avoiding potential performance issues in application code.

Summary #

Today, I have introduced you to the importance of RDD, as well as its two major core features and four major attributes.

Firstly, understanding RDD is beneficial for developers for the following three reasons:

  • Many core concepts of Spark are derived from RDD. Understanding RDDs helps you to learn Spark comprehensively.
  • Remembering the key features and core attributes of RDDs helps you better locate performance bottlenecks at runtime, and bottleneck localization is the premise of performance tuning.
  • Understanding RDDs in depth helps you break out of a single-machine thinking mode and avoid leaving performance risks in your application code.

Regarding the features and core attributes of RDDs, as long as you keep the following two points in mind, I believe that you will naturally bypass many performance pitfalls without realizing it:

  • Horizontal attributes, partitions and partitioners, anchor the data shard entities and determine how data shards are distributed in a distributed cluster.
  • Vertical attributes, dependencies and compute, are used to build a DAG in the vertical direction, ensuring the stability of in-memory calculation by providing the fault tolerance capability of reconstructing RDDs.

Daily Exercise #

  1. Have you ever encountered the “single-machine thinking mode” in your daily development work? What are some examples?

  2. In addition to the four major properties we discussed today, RDD also has an important property: preferredLocations. Based on your experience, in which situations do you think preferredLocations are important and can improve I/O efficiency? And in which environments do they not work? Why?

Looking forward to seeing your thoughts in the comments section. Feel free to share the “single-machine thinking mode” you’ve encountered in your work as well. See you in the next class!