04 Dag & Pipelining What Is in Memory Computing

04 DAG & Pipelining - What is In-Memory Computing #

Hello, I’m Wu Lei.

In my daily development work, I have noticed two common phenomena.

The first is the misuse of caching. Whether it’s RDD or DataFrame, developers always use “cache” to store the generated datasets, resulting in extremely poor application performance. Developers often feel frustrated and ask, “Isn’t Spark an in-memory computing framework? Why does caching the data in memory actually make the performance worse?”

The second phenomenon is related to Shuffle. We all know that Shuffle is a performance killer in Spark, and we should avoid it as much as possible when developing applications. However, from my observation, many beginners lack the motivation to refactor their code to avoid Shuffle. They often think, “As long as the business functionality is implemented, it’s good enough. Will there be a significant performance improvement even if I put in a lot of effort to rewrite the code and eliminate Shuffle?”

These two phenomena may not be of concern to most people, but often these details determine the performance of an application. In my opinion, the root cause of these two phenomena lies in the insufficient understanding of Spark’s in-memory computing by developers. Therefore, today, let’s talk about the various meanings of Spark’s in-memory computing.

First Meaning: Distributed Data Caching #

When it comes to the meaning of “in-memory computing” in Spark, your first reaction is likely to be: Spark allows developers to cache distributed datasets in the memory of compute nodes, enabling efficient data access. Yes, this is the first meaning of in-memory computing: the well-known distributed data caching.

RDD caching is indeed a highlight of the Spark distributed computing engine and one of the many tools for performance optimization in business applications. Many technical blogs and even the Spark official website tirelessly emphasize the importance of RDD caching for application performance.

Due to these considerations, many developers tend to abuse the cache mechanism without thinking in their code, which is the first phenomenon we just mentioned. However, these developers overlook an important detail: caching is only necessary for datasets that need to be frequently accessed. For datasets that are accessed only once, caching not only fails to improve execution efficiency but also incurs additional performance overhead, leading to counterproductive results.

The reason why this important detail is overlooked is that developers’ understanding of in-memory computing only stays at the caching level. Therefore, when the execution performance of a business application encounters problems, they can only resort to desperate measures and cling to caching as a last straw, resulting in deeper troubles.

Next, let’s focus on the second meaning of in-memory computing: the pipeline-style computing mode within a stage.

In Spark, in-memory computing has two meanings: the first meaning is the well-known distributed data caching, and the second meaning is the pipeline-style computing mode within a stage. I will explain the working principles of RDD caching in detail in subsequent lessons, but today we will focus on the second meaning of in-memory computing.

Second Meaning: Pipeline Computing Mode in Stages #

Clearly, in order to understand the second meaning of in-memory computing, we need to start with the division of stages in the DAG. But before that, let’s talk about what a DAG is.

What is a DAG? #

DAG stands for Directed Acyclic Graph. As the name implies, a DAG is a “graph”. We know that any graph consists of two basic elements: vertices and edges. Vertices are typically used to represent entities, while edges represent the relationships between entities. In Spark’s DAG, vertices are RDDs, and edges are the parent-child relationships between RDDs through the dependencies property.

Explaining the DAG from a theoretical perspective may be tedious, so I plan to use the example of the Potato Workshop from the previous lecture to help you understand the DAG intuitively. In the previous lecture, the Potato Workshop successfully produced three different sizes of canned “plain” potato chips. However, after the “plain” potato chips were on the market for some time, the sales plummeted. The workshop owner couldn’t help but be anxious and worried. At this point, the workshop foreman suggested, “Boss, why don’t we slightly modify the assembly line and introduce potato chips with different flavors to cater to the diversified market demand?” Then, the foreman handed the owner the effect diagram of the modification, and the owner was very satisfied after seeing it.

However, modifying the assembly line is a big project. In order to allow the modified workers to collaborate efficiently, the foreman needs to abstract the modification plan into a construction process diagram. With this blueprint, the foreman can assign tasks to the responsible workers, and everyone can work together. In the previous lecture, we compared the form of ingredients to RDDs, and the relationship between adjacent forms of ingredients to the dependencies between RDDs. So obviously, the construction process diagram of the assembly line is the DAG.

Because each vertex in the DAG is composed of an RDD, in the above diagram, it corresponds to the RDDs such as “potatosRDD” with mud, “cleanedPotatosRDD” with cleaned potatoes, and “flavoursRDD” with seasoning powder, etc. The edges of the DAG indicate the dependencies and transformations between different RDDs. Obviously, each edge in the above diagram has directionality, and the entire graph does not have any cycle.

So how is the DAG generated?

We all know that in the Spark development model, application development is actually the process of flexibly using operators to implement business logic. Developers call operators on distributed datasets such as RDDs, DataFrames, or Datasets, and encapsulate the calculation logic. This process will generate new child RDDs. At the same time, the child RDD assigns the dependencies property to the parent RDD and assigns the compute property to the calculation logic encapsulated by the operator. This process continues as developers continue to call other operators on the child RDD and generate new RDDs, and so on, resulting in a DAG.

Therefore, from the perspective of developers, the construction of the DAG is accomplished by continuously calling operators on distributed datasets.

Division of Stages #

Now that we know what a DAG is and how it is constructed. However, a DAG is ultimately just a flowchart. Spark needs to convert this flowchart into distributed tasks in order to fully utilize the advantages of parallel computing in a distributed cluster. This is just like the construction process diagram of the Potato Workshop is just a blueprint. The foreman needs to find a way to convert it into a real potato processing assembly line that can continuously produce potato chips with different flavors in order to solve the owner’s urgent need.

In simple terms, from the construction of the DAG by the developer to the execution of the distributed tasks transformed from the DAG in a distributed environment, the process involves the following four stages:

  • Backtracking the DAG and dividing it into stages
  • Creating distributed tasks within the stages
  • Distributing the distributed tasks
  • Executing the distributed tasks

We just mentioned that the second meaning of in-memory computing lies within the stages. Therefore, in this lecture, it is enough for us to understand how the DAG is divided into stages. As for the remaining three stages, they are more in the realm of the scheduling system, so I will explain the ins and outs of those stages in the next lecture. If we summarize the process of transforming from DAG to Stages in one sentence, it would be: Starting from the Actions operator, backtrace the DAG from back to front, and use Shuffle operations as boundaries to divide Stages.

Next, let’s take the Potato Workshop as an example to explain this process in detail. Since the DAG is divided into Stages based on Shuffles, let’s take a bird’s eye view of the data distribution operations that need to be performed in the DAG of the Potato Workshop design process. Of course, in the Potato Workshop, the data is various forms of potatoes and potato chips.

By carefully observing the above design process diagram, we can easily find two places where data needs to be distributed. The first place is after the potato chips are baked and cooked, the ready-to-eat potato chips are distributed to the downstream assembly line according to their size. These assembly lines are dedicated to processing fixed-size potato chips, which is the line from bakedChipsRDD to flavouredBakedChipsRDD in the diagram. Similarly, different flavor powders also need to be distributed to the downstream assembly lines according to their different flavors, which are used to mix with fixed-size ready-to-eat potato chips. This is the branch from flavoursRDD to flavouredBakedChipsRDD in the diagram.

At the same time, we can also see that the DAG of the Potato Workshop should be divided into three Stages, as shown in the diagram. Among them, Stage 0 contains four RDDs, from the raw potatoes RDD to the ready-to-eat potato chips RDD. Stage 1 is relatively simple, it only has one RDD, which is the flavorsRDD that encapsulates the seasoning powder. Stage 2 contains two RDDs, one is the flavored ready-to-eat potato chips RDD with different flavors, and the other represents the bucket chips RDD, indicating that the assembly is complete and ready for sale.

You may ask, “What’s the point of turning a DAG into Stages after all this effort?” Well, it turns out that there is a second layer of meaning hidden in the Stages derived from the DAG. To understand the pipeline-style computation mode within each Stage, we need to start with the computing model of Hadoop MapReduce.

In-Memory Computing in Stages #

The in-memory computing model is not created out of thin air but carefully designed based on the lessons learned by previous researchers and the reflections of later researchers. The previous researcher is Hadoop MapReduce, and the later researcher is Spark.

MapReduce provides two types of computation abstractions, namely Map and Reduce: The Map abstraction allows developers to define data processing logic by implementing the map interface, and the Reduce abstraction is used to encapsulate data aggregation logic. The biggest problem with the MapReduce computing model is that all data exchanges between operations are done through disks. For example, the calculations between two Map operations, as well as between Map and Reduce operations, are all achieved by exchanging data using local disks. It is not difficult to imagine that this frequent disk I/O will undoubtedly drag down the end-to-end execution performance of user applications. So, what does this have to do with the pipeline-style computing mode inside a Stage? Let’s go back to the example of the potato workshop and focus on Stage 0, which is right before the distribution of instant potato chips. This stage consists of three processing operations: cleaning, slicing, and baking. In general, the pipeline-style operation is very efficient. After the potatoes with dirt are cleaned, they are transported along the assembly line to the slicer. The sliced raw potato chips are then further transported along the assembly line to the baking oven, completing the entire process seamlessly. If we think of the assembly line as the memory of the computing node, then these three operations, cleaning, slicing, and baking, are all performed in memory.

You may say, “In-memory computing is just about moving both data and computation into memory compared to MapReduce, right?” It may not be as simple as you imagine.

In the example of the potato workshop, each processing step in Stage 0 produces intermediate ingredients, such as cleaned potatoes, potato slices, and instant potato chips. We just compared the assembly line to memory, which means that each operator’s computed intermediate result is cached in memory for the next operator’s computation. This process is very similar to developers overusing RDD cache in application code. If you’ve ever used RDD cache excessively, you can probably imagine that using this computing mode may not necessarily improve Spark’s execution performance compared to MapReduce, especially when there are many operators in the stages.

Since it’s not just a matter of moving data and computation to memory, what does the pipeline-style computing mode within a Stage actually look like? In Spark, the pipeline computing mode refers to: within the same Stage, all operators are merged into one function, and the output results of the Stage are generated by applying this function once to the input dataset. This is also the second layer of meaning of in-memory computing. Let’s use an image to explain this computing mode visually.

As shown in the diagram, in the above computation flow, if you consider the assembly line as memory, temporary data will be generated after each step of operation, such as “clean” and “slice” in the diagram, and these temporary data will be cached in memory. But in the in-memory computing below, all operation steps like “clean”, “slice”, “bake” are combined into one function. This function applies to the “potatoes with dirt” directly, generating “instant potato chips” without producing any intermediate data forms in memory.

So, you see, in-memory computing not only means that data can be cached in memory, but more importantly, it allows us to understand that by merging computations, the efficiency of data transformation in memory can be greatly improved, thereby enhancing the overall execution performance of applications.

At this point, we can answer the second question raised at the beginning: How much performance gain can we achieve by rewriting the code and eliminating Shuffle?

Since the computation fusion only happens within Stages, and Shuffle marks the boundary of splitting Stages, the code fusion of in-memory computing will be interrupted once Shuffle occurs. However, when we have a comprehensive understanding of in-memory computing, we won’t just think of using cache to improve the execution performance of applications. Instead, we will actively look for ways to avoid Shuffle as much as possible and merge as many parts of the application code as possible into one function to improve computational efficiency.

Summary #

In this lecture, we discussed the meaning of in-memory computing in Spark using two common examples.

In Spark, in-memory computing has two layers of meaning: the first layer is the well-known distributed data caching, and the second layer is the pipeline-like computing mode within a stage.

For the second layer of meaning, we need to understand the DAG and Stage partitioning. From the developer’s perspective, the construction of the DAG is achieved by continuously invoking operators on the distributed datasets. The DAG starts from Actions operators and backtracks from the end, partitioning different stages based on Shuffle operations.

Finally, we summarize the more complete meaning of in-memory computing in the second layer: all operators within the same stage are fused into a single function, and the output of the stage is generated by applying this function once to the input dataset.

Daily Practice #

Today’s focus is on understanding, and I hope you can consolidate your knowledge by considering the following two questions:

  1. Today we talked about how DAG divides stages based on Shuffle. Do you know how Spark determines whether an operation will introduce Shuffle?

  2. In Spark, all operators within the same Stage are merged into one function. Do you know how this step is achieved?

I look forward to seeing your thoughts and answers in the comments. If you still have many questions about in-memory computing, feel free to write them in the comments as well. See you in the next lecture!