04 Execution Models and Distributed Deployment How Distributed Computing Works

04 Execution Models and Distributed Deployment - How Distributed Computing Works #

Hello, I am Wu Lei.

At the end of [Lecture 2], we left a question for reflection. What are the differences and connections between the Word Count computational flowchart and the assembly line process in the Potato Workshop? If you don’t quite remember, you can take a look at the pictures below to refresh your memory.

Let’s start with the differences. First, the Word Count computational flowchart is an abstract flowchart, while the assembly line in the Potato Workshop is an operational, runnable, and specific set of steps. Furthermore, each element in the computational flowchart, such as lineRDD and wordRDD, is a “virtual” data set abstraction, while the different forms of ingredients on the assembly line, such as dirty potatoes, are “real” objects.

After clarifying the differences between the two, their connections naturally become apparent. If we consider the computational flowchart as a “design blueprint,” then the assembly line process is actually the “construction process.” The former is at the design level, providing overarching guidance, while the latter is at the execution level, following a step-by-step implementation. The former lays the foundation for the latter, and the latter materializes the former.

You may wonder, “Why do we need to understand the differences and connections between these two?” The reason is simple: the essence of distributed computing lies in how to transform abstract computational flowcharts into tangible distributed computing tasks and then deliver them for execution in a parallel computing manner.

In today’s lecture, we will discuss how Spark achieves distributed computing. The implementation of distributed computing relies on two key factors: the process model and the deployment of the distributed environment. Next, we will explore Spark’s process model and introduce the various distributed deployment methods used by Spark.

Process Model #

In Spark application development, the entry point for any application is the main function with SparkSession. SparkSession is versatile, as it provides a Spark runtime context (such as scheduling system, storage system, memory management, and RPC communication) and also provides developers with development APIs to create, transform, and compute distributed datasets (such as RDDs).

However, in the Spark distributed computing environment, there is only one JVM process running such main function, and this special JVM process has a specific term in Spark, called “Driver”.

The core role of the Driver is to parse user code, construct computational graphs, convert the computational graphs into distributed tasks, and deliver the tasks to the executing processes in the cluster. In other words, the role of the Driver is to break down tasks and assign them, while the actual “hard work” is done by the executing processes. In the distributed environment of Spark, such executing processes can be one or more, and they also have a specific term called “Executor”.

I have depicted the relationship between the Driver and the Executors in a picture, take a look:

The core of distributed computing lies in task scheduling, and the scheduling and execution of distributed tasks rely on the collaboration between the Driver and Executors. In the following courses, we will delve into how the Driver collaborates with numerous Executors to complete task scheduling. However, before that, we need to clarify the relationship between the Driver and Executors in order to lay a solid foundation for the subsequent courses.

Driver and Executors: Contractor and Construction workers #

Simply put, the relationship between the Driver and Executors is like the relationship between a contractor and construction workers on a construction site. The contractor is responsible for “taking on tasks”, after receiving the design blueprint, they are responsible for breaking down the tasks and dividing the 2D blueprint into tasks like compacting soil, foundation construction, bricklaying, and pouring reinforced concrete, and then assigning the tasks to their workers. After claiming the tasks, the workers independently complete their own tasks and only communicate and coordinate when necessary.

In fact, different construction tasks often have dependencies. For example, bricklaying can only be done after the foundation is complete, and similarly, pouring reinforced concrete can only be done after the brick wall is laid. Therefore, one of the important responsibilities of the Driver is to break down and arrange construction tasks in a reasonable and orderly manner.

Furthermore, in order to ensure construction progress, the Driver not only distributes tasks but also needs to communicate regularly with each Executor, promptly obtain their progress, and coordinate the overall execution progress.

One fence requires three stakes, and one hero requires three helpers. In order to fulfill the series of responsibilities described above, the Driver naturally needs some powerful helpers. In the Driver process of Spark, the three objects DAGScheduler, TaskScheduler, and SchedulerBackend cooperate to complete the 3 core steps of distributed task scheduling, namely:

Construct the computational graph based on user code;
Decompose the computational graph into distributed tasks;
Distribute the distributed tasks to the Executors.

Upon receiving the tasks, Executors invoke internal thread pools, combine pre-allocated data partitions, and execute the task code concurrently. For a complete RDD, each Executor is responsible for processing a subset of data partitions for that RDD. It’s like three workers, A, B, and C, respectively claiming one-third of all bricks on the construction site and constructing the east, west, and north walls.

Alright, so far, we have a basic understanding of the concepts of the Driver and Executors, their respective responsibilities, and their relationship with each other. Although we still need to delve into some key objects, such as the aforementioned DAGScheduler and TaskScheduler, it does not prevent us from understanding the Spark process model from a superior perspective.

However, you may say, “When it comes to models, it always feels abstract. Can we combine examples to explain more concretely?” Next, we will continue to use the Word Count example demonstrated in the previous two lessons to explore how spark-shell works behind the scenes.

Analysis of the spark-shell execution process #

In Lesson 1, we set up a local Spark runtime environment on our own machine and entered the interactive REPL by typing spark-shell in the terminal. Like many other system commands, spark-shell has many command-line arguments, among which the most important ones are two types: one is used to specify the deployment mode master, and the other is used to specify the computing resource capacity of the cluster. The spark-shell command without any parameters is actually equivalent to the following command:

spark-shell --master local[*]

This line of code has two meanings. The first meaning is the deployment mode, where the keyword local indicates that the deployment mode is local, which means it is deployed locally. The second meaning is the deployment scale, which is the number inside the square brackets. It represents how many executors need to be started in the local deployment. The asterisk (*) means that this number should be consistent with the available number of CPUs in the machine.

In other words, suppose your laptop has 4 CPUs. When you type spark-shell in the command line, Spark will start 1 driver process and 3 executor processes in the background.

Now here comes the question, how do the driver process and the 3 executor processes cooperate to execute distributed tasks when we enter the Word Count example code into spark-shell?

To help you understand this process, I have drawn a diagram. You can take a look at the overall execution process:

First, the driver triggers the execution of the previously constructed computation graph by using the take action operator. Following the execution direction of the computation graph, which is the top-down direction in the diagram, the driver creates and distributes distributed tasks with shuffle as the boundary.

The original intention of shuffle is “shuffling” in playing cards. In the context of big data, shuffle refers to the inter-process and inter-node data exchange within a cluster. We will explain shuffle in detail in future articles. For now, let’s use the Word Count example to understand shuffle briefly.

Before the reduceByKey operator, the same word, such as “spark,” may be distributed among different executor processes, such as Executor-0, Executor-1, and Executor-2 as shown in the diagram. In other words, these executors process data partitions that all contain the word “spark”.

To count the occurrences of “spark”, we need to distribute all instances of “spark” to the same executor for computation. This process of redistributing words scattered among different executors to the same executor is called shuffle.

After briefly understanding shuffle, let’s continue discussing how the driver creates distributed tasks. For all operations before reduceByKey, such as textFile, flatMap, filter, and map, the driver combines them into one task and distributes this task to each executor.

After receiving the tasks, the three executors first parse the tasks and break them down into four steps: textFile, flatMap, filter, and map. Then each executor independently processes the data partition it is responsible for.

To facilitate explanation, let’s assume the parallelism degree is 3, which means the original data file wikiOfSpark.txt is divided into 3 partitions, with each executor processing one partition. After processing the data, the content of the partitions is transformed from an RDD[String] into an RDD[(String, Int)] that contains key-value pairs, where the count of each word is set to 1. At this point, the executors promptly report their progress to the driver, making it easier for the driver to coordinate everyone’s next steps.

At this point, to continue with the subsequent aggregation calculation, which is the counting operation, shuffle must be performed. After completing the data exchange among different executors for each word, the driver proceeds to create and distribute the next stage of tasks, which is grouping and counting by word.

After the data exchange, all the same words are distributed to the same executors. At this point, each executor receives the reduceByKey task and independently performs the counting operation. After completing the counting, the executors return the final calculation results to the driver.

In this way, spark-shell completes the calculation process of the Word Count user code. After analyzing the Spark process model and the association and connection between the driver and executors, I hope you have a clearer understanding and grasp.

However, so far, the explanation of the Word Count example and spark-shell has been based on a local deployment environment. We know that the real power of Spark lies in parallel computing in a distributed cluster. Only by fully utilizing the computing resources of each node in the cluster can we fully unleash the performance advantages of Spark. Therefore, it is necessary for us to learn and understand the distributed deployment of Spark.

Distributed environment deployment #

Spark supports multiple distributed deployment modes, such as Standalone, YARN, Mesos, and Kubernetes. Among them, Standalone is the built-in resource scheduler of Spark, while YARN, Mesos, and Kubernetes are independent third-party resource schedulers and service orchestration frameworks.

Since the latter three provide independent and complete resource scheduling capabilities, for these frameworks, Spark is just one of the many computing engines they support. The steps for distributed deployment of Spark on these independent frameworks are relatively few and the process is relatively simple. As developers, we only need to download and unzip the Spark installation package, adjust the Spark configuration file appropriately, and modify the environment variables.

Therefore, we will not go into detail about the deployment modes of YARN, Mesos, and Kubernetes. I will leave them to you as homework to explore. In today’s lecture, we will focus only on the distributed deployment of Spark in Standalone mode.

To illustrate the deployment process in Standalone mode, I have created and started three EC2 compute nodes in the AWS environment, with the Linux/CentOS operating system.

It should be noted that there are no requirements or restrictions on the node type and operating system for the deployment of Spark distributed computing environments. However, in actual deployment, please try to keep the operating system of each compute node consistent to avoid unnecessary troubles.

Next, I will guide you step by step to complete the distributed deployment in Standalone mode.

At the resource scheduling level, Standalone adopts a master-slave architecture, where the roles of the compute nodes are divided into Master and Worker. There is only one Master, while there can be one or more Workers. All Worker nodes periodically report their available resource status to the Master. The Master is responsible for aggregating, changing, and managing the available resources in the cluster, and responds to resource requests from the Driver in Spark applications.

For convenience, we set the hostname of the three EC2 instances as node0, node1, and node2, and select node0 as the Master node, and node1 and node2 as the Worker nodes.

First, to achieve seamless communication between the three machines, let’s configure passwordless SSH communication among the three nodes:

Next, let’s prepare the Java environment on all nodes and install Spark (you can refer to the link here for “sudo wget” in step 2). The commands are shown in the following table:

After installing Spark on all nodes, we can start the Master and Worker nodes one by one, as shown in the following table:

After the cluster is started, we can use the built-in small example in Spark to verify if the Standalone distributed deployment is successful. First, open the command line terminal of the Master or Worker, then enter the following command:

MASTER=spark://node0:7077 $SPARK_HOME/bin/run-example org.apache.spark.examples.SparkPi

If the program can successfully compute the value of Pi, which is 3.14, as shown in the figure below, it means that our Spark distributed computing cluster is ready. You can compare the screenshots in the document to verify if your environment is also successful.

Key Review #

In today’s lecture, we mentioned that the essence of distributed computing lies in how to transform abstract computational flow graphs into real distributed computing tasks, and then deliver them for parallel execution. To truly understand distributed computing, you need to master the Spark process model.

The core of the process model is the Driver and Executors, and we need to focus on understanding their cooperation. The entry point of any Spark application is the main function with a SparkSession. In the distributed computing environment of Spark, there is only one JVM process running this main function, which is called the “Driver”.

The most important role of the Driver is to parse user code, build computational flow graphs, transform the graphs into distributed tasks, and distribute the tasks to Executors in the cluster for execution. Upon receiving the tasks, Executors call the internal thread pool and execute the task code concurrently with the pre-allocated data partitions.

For a complete RDD, each Executor is responsible for processing a subset of data partitions for that RDD. Whenever a task is completed, Executors promptly communicate and report the task status to the Driver. After obtaining the progress of Executors’ executions, the Driver combines task decomposition of the computational flow graph and sequentially distributes the tasks for the next stage to Executors for execution until the entire computational flow graph is completed.

We then introduced the distributed deployment modes supported by Spark, including Standalone, YARN, Mesos, and Kubernetes. Among these deployment modes, you need to focus on mastering the operational steps for deploying in the Standalone environment.

Practice for Each Lesson #

Great! At the end of this lecture, I have two exercises for you to reinforce what you have learned today.

Similar to the take operator, the collect operator is used to collect the computation results. Considering the Spark process model, can you point out some risks of using the collect operator compared to the take operator?
If you are using YARN, Mesos, or Kubernetes in your production environment, can you explain the necessary steps to achieve distributed deployment of Spark under these independent frameworks?

This concludes today’s lecture. If you encounter any problems during the deployment process, feel free to ask questions in the comment section. If you found this lecture helpful, please share it with more friends and colleagues. See you in the next lecture.