02 Rdd and Programming Model What Is Lazy Evaluation

02 RDD and Programming Model - What is Lazy Evaluation #

Hello, I’m Wu Lei.

In the previous lecture, we developed a Word Count application together and executed it in spark-shell. The calculation steps of Word Count are very simple: first, we read the data source, then we tokenize the words, and finally we group and count the words, printing the words with the highest frequency on the screen.

If you also practiced this example, you might have noticed that in the spark-shell REPL, all the code returns immediately and executes quickly. However, only the last line of code took a long time to print the words “the”, “Spark”, “a”, “and”, and “of” on the screen.

You might find this phenomenon strange: “Reading the data source and grouping and counting should be the most time-consuming steps, so why did they return instantly? Printing the words should be a quick task, so why was it the most time-consuming step instead?” To answer this question, we need to start with RDDs.

What is RDD #

Why start with RDD? First of all, RDD is the cornerstone of the Spark distributed in-memory computing engine. Many core concepts and components of Spark, such as DAG and scheduling system, are derived from RDD. Therefore, having a deep understanding of RDD will help you to learn the working principles of Spark more comprehensively and systematically.

Secondly, although the usage of RDD API is decreasing, and most people are already familiar with DataFrame and Dataset APIs, regardless of which API or programming language you use, your application will eventually be transformed into distributed computing on top of RDD within Spark. In other words, if you want to have a better grasp of Spark jobs, it is necessary to have a sufficient understanding of RDD.

Since RDD is so important, what exactly is it? In summary, RDD is an abstraction, which is Spark’s abstraction for distributed datasets, encompassing distributed data entities in memory and on disk.

In the previous lecture, we interpreted RDD as an array. Let’s continue this line of thought and compare RDD with an array.

I have prepared a table comparing RDD with an array, you can take a quick look:

In the table, I have compared arrays and RDDs from four aspects. Now, let me explain in detail.

Firstly, in terms of the concepts themselves, an array is a concrete entity, it is a data structure for storing elements of the same type, while RDD is an abstraction, representing distributed datasets in a distributed computing environment.

Therefore, the second difference between these two lies in the scope. The “scope” of an array is narrow, limited to a process within a single computing node, while the data represented by RDD spans across processes and nodes in a cluster, its “scope” is the whole cluster.

As for the third difference between arrays and RDDs, it lies in data positioning. In an array, the basic unit carrying data is an element, while in RDD, the basic unit carrying data is a data partition. In a distributed computing environment, a complete dataset is divided into multiple data partitions according to certain rules. These data partitions are evenly distributed to different computing nodes and execution processes in the cluster, thereby achieving distributed parallel computing.

From the above comparisons, it is not difficult to find that data partitioning is one of the important attributes of the RDD abstraction. After getting a preliminary understanding of RDD, let’s change our perspective and further understand RDD from the important attributes of RDD. To fully comprehend RDD, we need to grasp its 4 major attributes:

partitions: data partitions
partitioner: partitioning rules
dependencies: RDD dependencies
compute: transformation function

If we only talk about these 4 major attributes based on theory, it may be too boring and uninteresting! Therefore, let’s start with a story about making potato chips to better understand the 4 major attributes of RDD.

Analyzing the 4 properties of RDD from the processing process of potato chips #

In a long time ago, there was a workshop that produced potato chips in bulk. The workshop was small in scale and the technology was relatively primitive. In order to make full use of each potato and reduce production costs, the workshop used 3 assembly lines to simultaneously produce 3 different sizes of bottled potato chips. The 3 assembly lines could process 3 potatoes simultaneously, and the operation process of each assembly line was the same, which included cleaning, slicing, baking, distributing, and packaging. Among them, the distribution process was used to distinguish between small, medium, and large potato chips, and the 3 different sizes of potato chips were sent to the 1st, 2nd, and 3rd assembly lines, respectively. The specific process is shown in the figure below.

Alright, the story is over. So, what interesting discoveries can we make by comparing the potato chip production process with Spark’s distributed computing if we think of each assembly line as a computational node in a distributed operating environment?

Obviously, each form of food material here, such as “potato with soil”, “clean potato”, “potato chips”, etc., can be seen as individual RDDs. And the process of making potato chips is actually the transformation process of different forms of food materials.

At first, the workers loaded “potatoes with soil” from the sacks onto the assembly line. After being washed, these potatoes turned into “clean potatoes”. Next, the slicing machine on the assembly line sliced the “clean potatoes” into “potato chips”, and then these potato chips were put into the oven. Finally, after being baked, the potato chips turned into ready-to-eat potato chips that can be consumed with confidence.

By analyzing this, it is not difficult to find that the transformation process between different forms of food materials is similar to the transformation process between different RDDs in a Word Count operation.

Therefore, next, let’s combine the potato chip production process to understand the 4 properties of RDD.

Firstly, let’s observe the potato workshop’s production process from top to bottom, that is, vertically.

We can see that for each form of food material, there are multiple physical entities on the assembly line corresponding to it. For example, “potatoes with soil” is a form of food material, and there are a total of 3 “dirty” potatoes belonging to this form on the assembly line.

If we consider “potatoes with soil” as an RDD, then the partitions property of the RDD includes all those dirty potatoes in the sack. Similarly, all the cleaned potatoes on the assembly line together constitute the partitions property of the “clean potatoes” RDD.

Let’s take a look at the partitioner property of RDD. This property defines the rule for cutting the original dataset into data partitions. In the potato workshop example, the rule for cutting the “potatoes with soil” RDD is to randomly select, that is, randomly take a dirty potato from the sack and put it on the assembly line. The subsequent forms of food materials, such as “clean potatoes”, “potato chips”, and “ready-to-eat potato chips”, adopt the same cutting rule as the “potatoes with soil” RDD. In other words, these subsequent RDDs inherit the partitioner property of the previous RDD.

What is different here is the “distributed ready-to-eat potato chips”. Obviously, the “distributed ready-to-eat potato chips” are obtained by distributing the “ready-to-eat potato chips” according to their size. In other words, for the “distributed ready-to-eat potato chips”, its partitioner property redefines the cutting rule of this RDD’s data partitions, which means the previous RDD’s data partitions are shuffled and reorganized according to the size of the potato chips. From this example, we can see that the distribution of data shards is determined by the partitioner of RDD. Therefore, the partitions attribute of RDD is strongly related to its partitioner attribute.

When viewed horizontally, it looks like a ridge; when viewed vertically, it looks like a peak. Many things may look completely different from a different perspective. So next, let’s take a horizontal perspective, that is, from left to right, to observe the production process of the potato workshop.

It is not difficult to find that each form of food on the assembly line is obtained by transforming the previous form of food through some operation. For example, the “sliced potato” depends on the form of food “clean potato”, and the operation used for transformation is the action of “slicing”. Looking back at the transformation relationship between RDDs in Word Count, we will also find similar phenomena.

In the process of data transformation, each RDD will use the dependencies attribute to record the previous or multiple RDDs it depends on, which are referred to as “parent RDDs”. At the same time, RDD uses the compute attribute to record the transformation operation from the parent RDDs to the current RDD.

Taking the wordRDD in Word Count as an example, its parent RDD is lineRDD, so its dependencies attribute records lineRDD. The transformation from lineRDD to wordRDD depends on the flatMap operation, so the compute attribute of wordRDD records the flatMap transformation function.

In summary, the processing flow of potato chips corresponds to the concept of RDD and its four major attributes:

Different forms of food, such as potatoes with soil, sliced potatoes, and ready-to-eat potato chips, correspond to the concept of RDD;
The concrete objects of the same form of food on different assembly lines are the partitions attribute of RDD;
The rule of how food is allocated to which assembly line corresponds to the partitioner attribute of RDD;
Each form of food depends on the previous form, and this dependency relationship corresponds to the dependencies attribute in RDD;
The processing methods at different stages correspond to the compute attribute of RDD.

After understanding the four major attributes of RDD, you also need to further understand the programming model and lazy evaluation of RDD. The programming model guides us on how to implement the code, while lazy evaluation is the foundation of Spark’s distributed execution mechanism. Only by understanding the programming model and lazy evaluation can you smoothly develop applications on Spark, implement business logic, and avoid performance problems.

Programming Model and Lazy Evaluation #

Do you remember the last question I left you with in the last lecture: what do map, filter, flatMap, and reduceByKey have in common? Now let’s reveal the answer:

Firstly, these four operators all work on RDDs for RDD transformations. For example, flatMap works on lineRDD to transform it into wordRDD.

Secondly, these operators are functions themselves, and their parameters are also functions. Functions that have function parameters or function return values are known as “higher-order functions.” In other words, these four operators are all higher-order functions.

We will explore the functions and advantages and disadvantages of higher-order functions later. Here, let’s focus on the first commonality of RDD operators: RDD transformation.

RDD is the abstraction of distributed datasets in Spark, and each RDD represents a form of distributed data. For example, lineRDD represents data existing in the cluster in the form of lines, while wordRDD means that the data is in the form of words distributed in the compute cluster.

Understanding RDD, what is RDD transformation then? Don’t worry, let me explain it with the implementation code of Word Count from the last time. Below is the code we used before:

import org.apache.spark.rdd.RDD
val rootPath: String = _
val file: String = s"${rootPath}/wikiOfSpark.txt"
// Read file content
val lineRDD: RDD[String] = spark.sparkContext.textFile(file)
// Tokenize by line
val wordRDD: RDD[String] = lineRDD.flatMap(line => line.split(" "))
val cleanWordRDD: RDD[String] = wordRDD.filter(word => !word.equals(""))
// Convert RDD elements to (Key, Value) format
val kvRDD: RDD[(String, Int)] = cleanWordRDD.map(word => (word, 1))
// Group and count by word
val wordCounts: RDD[(String, Int)] = kvRDD.reduceByKey((x, y) => x + y)
// Print the top 5 words by frequency
wordCounts.map{case (k, v) => (v, k)}.sortByKey(false).take(5)

Reviewing the Word Count example, we can see that the implementation process of Word Count is actually a transformation process between different RDDs. If we look carefully, there are a total of 4 RDD transformations in the Word Count example, and I will explain them in detail:

Initially, we generate lineRDD by calling the textFile API, and then use the flatMap operator to transform lineRDD into wordRDD. Next, the filter operator filters wordRDD and transforms it into cleanWordRDD without empty strings. Then, in order for subsequent aggregation calculations, the map operator converts cleanWordRDD into kvRDD, an RDD of (Key, Value) pairs. Finally, we use the reduceByKey operator to group and aggregate the values in kvRDD from 1 to the word count.

The process of these 4 transformations is shown in the following diagram:

As we just mentioned, RDD represents the form of distributed data, so the transformation between RDDs is essentially a transformation of data forms (Transformations).

In the RDD programming model, there are two types of operators: Transformations and Actions. Developers need to use Transformations operators to define and describe the process of data form transformation, and then use Actions operators to collect the calculation results or materialize them to disk.

In this programming model, Spark’s calculations at runtime are divided into two steps.

Based on the transformation between different data forms, construct a computation graph (DAG, Directed Acyclic Graph).
Through Action operators, trigger the execution of this computation graph in a backtracking manner.

In other words, the Transformations operators called by developers do not execute computations immediately, only when developers call the Action operators, will the previously called transformation operators start executing. In the industry, there is a term specifically used to describe this computing mode, called “Lazy Evaluation”.

Lazy Evaluation provides a good explanation for the question at the beginning of this lecture: why does Word Count only spend a long time in the execution process on the last line of code, while the previous lines of code are executed instantly?

The answer here is Spark’s Lazy Evaluation. The flatMap, filter, map operators are only used to construct the computation graph. Therefore, when you type these codes in spark-shell, spark-shell will return immediately. Only when you enter the last line of code containing “take”, Spark will trigger the execution of the entire computation process from start to finish. So, intuitively, the last line of code appears to be the most time-consuming.

The entire running process of the Spark program is shown in the following diagram:

You may ask, “Under the RDD development framework, which operators belong to Transformations operators? Which operators belong to Actions operators?”

We all know that Spark has many operators, and the official Spark website provides a complete collection of RDD operators. However, for these operators, the official website mainly presents them in a listing manner without classification, which makes it difficult to read and understand. Therefore, I have classified the commonly used RDD operators and organized them into the table below for your reference.

Combining the classification, usage, and applicable scenarios of each operator, this table can help you choose the appropriate operators more quickly and efficiently to implement your business logic. For operators that you are not familiar with in the table, such as aggregateByKey, you can refer to the official website’s introduction and explanation or search for relevant information online to gain a deeper understanding. We will explain important operators in detail in future lessons.

Key Review #

In today’s lecture, we focused on the programming model and lazy evaluation of RDDs, and used the analogy of a potato factory to explain what RDDs are. RDD is Spark’s abstraction for distributed datasets, used to encompass all distributed data entities in memory and on disk. When it comes to RDDs, it is important to understand its four main properties, which are the foundation for our future learning:

partitions: data partitioning
partitioner: partitioning rules
dependencies: RDD dependencies
compute: transformation functions

After gaining a deep understanding of RDDs, you need to be familiar with the RDD programming model. In the RDD programming model, developers need to use Transformation operations to define and describe the process of transforming data, and then use Action operations to collect or materialize the computation results on disk.

Lazy evaluation means that the various Transformation operations called by developers are not executed immediately. They are only executed when developers call Action operations.

Practice for each lesson #

Despite the Word Count calculation flowchart and the potato workshop assembly line process appearing to be unrelated and irrelevant, take some time to think about the differences and connections between them.

I welcome you to share your answers in the comments section. I will be waiting for you there. You are also welcome to share this lesson with more friends and colleagues. See you in the next lesson!