In Depth Interpretation of Spark Job on Spark Ui Part 1

In-Depth Interpretation of Spark Job on Spark UI - Part 1 #

Hello, I am Wu Lei.

It’s been a long time since our last column ended, but I have been keeping up with the comments from students during this time. Today, I am here again with your expectations.

In the lecture on the essence of performance tuning (Lecture 2), we mentioned the performance tuning methodology.

One of the key methods is to identify performance bottlenecks based on expert experience or runtime diagnostics. Spark UI, as the built-in runtime monitoring interface of Spark, is an important tool that we must master. In addition, as the course progresses, many students have provided feedback and asked me to include more content on Spark UI.

Considering the above points, I will supplement the course content with an extra lesson on Spark UI, hoping that it will be helpful to you.

In our daily development work, we often encounter situations where Spark applications fail to run or the execution efficiency falls short of expectations. To find the root cause of such problems, we can use Spark UI to obtain the most direct and intuitive clues. By comprehensively examining the Spark application, we can quickly locate the problem.

If we consider failed or inefficient Spark applications as “patients”, then the numerous metrics about the application in Spark UI are the “health report” of these patients. With a variety of metrics, developers, as “doctors”, can quickly identify the problem areas based on their experience.

In today’s lecture, let’s take the example of the “rate calculation” application in the car license plate lottery (you can review the detailed content in Lecture 30) and use illustrations to gradually familiarize ourselves with Spark UI. We will explore the key metrics featured in Spark UI, understand their meanings, and discover what insights they can provide to developers.

It should be noted that the introduction and explanation of Spark UI involve a large number of illustrations, code snippets, and metric explanations. Therefore, to reduce your learning burden, I have divided the discussion of Spark UI into two parts: Part 1 and Part 2, based on the entry types (primary entry and secondary entry). The explanation of the primary entry is relatively simple and direct, and today’s lecture will focus on this part. The explanation for the secondary entry will be covered in the next lecture.

Preparation #

Before we officially introduce Spark UI, let’s briefly explain the environment, configuration, and code used in the graphical case study. You can refer to the details provided here to reproduce each interface in the “Multiplication Calculation” case in Spark UI. Then, with today’s explanation, you can familiarize yourself with each page and metric in a more intuitive and in-depth way.

Of course, if you don’t have a suitable execution environment at hand, don’t worry. The special feature of this lecture is that there are many images. I have prepared a large number of pictures and tables to help you thoroughly understand Spark UI.

Since the volume of the car lottery data is not large, our resource requirements for the “Multiplication Calculation” case are not high. The resources used in the case are as follows:

Next is the code. In the lecture on car lottery application development, we gradually implemented the calculation logic for “Multiplication Calculation”. Let’s review it together here.

// HDFS root path address
val rootPath: String = "hdfs://hostname:9000"
 
// Applicant data
val hdfs_path_apply = s"${rootPath}/2011-2019 Car Lottery Data/apply"
val applyNumbersDF = spark.read.parquet(hdfs_path_apply)
// Create Cache and trigger cache calculation
applyNumbersDF.cache.count()
 
// Winner data
val hdfs_path_lucky = s"${rootPath}/2011-2019 Car Lottery Data/lucky"
val luckyDogsDF = spark.read.parquet(hdfs_path_lucky)
// Create Cache and trigger cache calculation
luckyDogsDF.cache.count()
 
val result05_01 = applyNumbersDF
    // Join based on carNum
    .join(luckyDogsDF.filter(col("batchNum") >= "201601").select("carNum"), Seq("carNum"), "inner")
    .groupBy(col("batchNum"), col("carNum"))
    .agg(count(lit(1)).alias("multiplier"))
    .groupBy("carNum")
    // Get the maximum multiplier
    .agg(max("multiplier").alias("multiplier"))
    .groupBy("multiplier")
    // Group and count by multiplier
    .agg(count(lit(1)).alias("cnt"))
    // Sort by multiplier
    .orderBy("multiplier")
 
result05_01.write.mode("Overwrite").format("csv").save(s"${rootPath}/results/result05_01")

Today, with some modifications on this basis, to display the contents of the StorageTab page, we “forcefully” added Cache to the applyNumbersDF and luckyDogsDF DataFrames. As we know, it is actually unnecessary to add Cache for datasets with a reference count of 1, so please pay attention to this.

After reviewing the code, let’s take a look at the configuration settings. In order for Spark UI to display running and completed applications, we need to set the following configuration settings and start the History Server.

// SPARK_HOME represents the Spark installation directory
${SPARK_HOME}/sbin/start-history-server.sh

Alright, everything is ready. Next, let’s start spark-shell, submit the code for “Multiplication Calculation”, and then focus on port 8080 of Host1, which is the 8080 port of the node where the Driver is located.

Spark UI Main Entry #

Today’s story starts with the main entry of Spark UI. As mentioned earlier, the 8080 port is the entrance to Spark UI, and we can access Spark UI from here.

Opening Spark UI, the first thing that catches our eyes is the default Jobs page. The Jobs page records the Actions involved in the application, as well as actions related to data reading and movement. Each Action corresponds to a Job, and each Job corresponds to a task. We will explore the Jobs page later, but for now, let’s focus on the top navigation bar of Spark UI. It lists all the main entries of Spark UI, as shown in the figure below.

As you can see, the leftmost part of the navigation bar is the Spark Logo and version number, followed by 6 main entries. I have summarized the functions and purposes of each entry in the table below. Take a look at the overall overview, and we’ll go into each entry in detail later.

To put it simply, these 6 different entries are like different categories of medical examination items in a health report, such as internal medicine, surgery, blood routine, and so on. Next, let’s open each item in the health report one by one to see how the “Load Calculation” is doing.

However, following the principle of starting with the simplest, we will not view each entry in the order listed in Spark UI, but in the order of Executors > Environment > Storage > SQL > Jobs > Stages, to view the “health report”.

Among them, the first three entries are detail pages and do not have sub-entries; the last three entries are preview pages and require accessing sub-entries to obtain more detailed content. Obviously, detail pages are more direct compared to preview pages. So let’s start with Executors and learn about the computational load of the application.

Executors #

The Executors Tab includes the following main contents, mainly consisting of “Summary” and “Executors”. Both parts contain the same set of metrics, with “Executors” providing more detailed information about each executor, while the first part “Summary” is a simple sum of all executor metrics below.

Let’s take a look at what metrics Spark UI provides to quantify the workload of each executor. For convenience, I will use a table to explain the meaning and purpose of these metrics.

It is not difficult to see that the Executors page clearly records the amount of data consumed by each executor and their consumption of hardware resources such as CPU, memory, and disk. Based on this information, we can easily determine whether there is a load imbalance between different executors, and make judgments on the possible data skew in the application.

Regarding the specific values of each metric on the Executors page, they are actually the summation of task execution metrics at the executor level. Therefore, the detailed interpretation of these metrics will be discussed when we expand on the Stages sub-entry. For now, you can browse through the specific values of different metrics and have an intuitive understanding of these numbers with the “Load Calculation” application.

In fact, except for two metrics, RDD Blocks and Complete Tasks, those specific values are not particularly significant. Taking a closer look at these two indicators, you will find that RDD Blocks is 51 (total count), while Complete Tasks (total count) is 862.

When we talked about RDD parallelism before, we mentioned that RDD parallelism is the number of partitions of an RDD, and each partition corresponds to a task. Therefore, RDD parallelism is consistent with the number of partitions and the number of distributed tasks. However, the 51 and 862 in the screenshot are obviously not in the same order of magnitude. What’s going on here?

I’ll leave it as a question for you to ponder. You may spend some time thinking about it. If you haven’t figured it out, it’s okay. We will continue discussing this issue in the comments section.

Environment #

Next, let’s talk about the Environment. As the name suggests, the Environment page records various environment variables and configuration information, as shown in the figure below.

In order to keep you on track, I haven’t shown you all the information contained in the Environment page. Broadly speaking, it contains 5 categories of environment information, and to facilitate discussion, I have listed them in the table below.

Clearly, among these 5 categories of information, Spark Properties is the most important one, which records all the Spark configuration settings in effect at runtime. Through Spark Properties, we can confirm whether the runtime settings are consistent with our expectations, thus eliminating stability or performance issues caused by incorrect configuration settings.

Storage #

After discussing Executors and Environment, let’s take a look at the last detail page of the top-level entry: Storage.

The Storage detail page records the details of each distributed cache (RDD Cache, DataFrame Cache), including cache level, cached partition count, cache fraction, memory size, and disk size.

In Lesson 16, we introduced the different cache levels supported by Spark. It is a combination of storage media (memory, disk), storage format (object, serialized bytes), and the number of replicas. For DataFrames, the default level is single replica Disk Memory Deserialized, as shown in the above figure, which means the storage media is memory plus disk, and the storage format is object with a single replica storage.

The Cached Partitions and Fraction Cached respectively record the number of partitions successfully cached in the dataset and the percentage of these cached partitions out of all partitions. When the Fraction Cached is less than 100%, it means the distributed dataset is not fully cached in memory (or on disk). In this case, we need to be cautious about the performance impact of cache eviction.

The Size in Memory and Size in Disk afterwards present the distribution of the dataset cache in memory and on disk in a more intuitive way. From the above figure, we can see that due to memory constraint (3GB/Executor), almost all of the lottery data is cached on disk, with only 584MB of data cached in memory. To be frank, this kind of caching does not bring substantial performance benefits for repeated access to the dataset.

Based on the detailed information provided by the Storage page, we can adjust the configuration settings related to memory, such as spark.executor.memory, spark.memory.fraction, and spark.memory.storageFraction, in order to make targeted adjustments to the Storage Memory.

SQL #

Next, let’s continue with the SQL page of the top-level entry. When our application includes DataFrames, Datasets, or SQL, the Spark UI SQL page will display the corresponding content, as shown in the following figure.

Specifically, the top-level entry page records the Spark SQL execution plans for each Action. We need to click on the hyperlink in the “Description” column to enter the secondary page to understand the detailed information of each execution plan. We will expand on this in the next lesson on the details of the secondary entry page.

Jobs #

Similarly, for the Jobs page, the Spark UI records the execution status of each Action at the granularity of Actions. In order to understand the job details, we also need to access the secondary entry link provided by the “Description” page. For now, just have a preliminary understanding, and we will delve into it in the next lesson.

Compared to the 3 Actions on the SQL page: save (saving computation results), count (counting application numbers), count (counting winning numbers).

If you look at the screenshot of the overview page earlier, you will find that the Jobs page seems to have many more Actions. The main reason is that on the Jobs page, Spark UI treats data reading, accessing, and moving as a type of “Action” as well. For example, the Jobs with Job Id 0, 1, 3, 4 are actually reading source data (metadata and dataset itself).

As for the additional Job with Job Id 7 called “save”, you can think about it in relation to the last line of code. I’ll leave it as a cliffhanger for now and give you enough time to think about it. Let’s meet in the comments.

result05_01.write.mode("Overwrite").format("csv").save(s"${rootPath}/results/result05_01")

Stages #

As we know, each job consists of multiple stages. The Stages page of the Spark UI lists all the stages involved in the application, and these stages belong to different jobs. In order to see which stages belong to which job, you need to enter the Descriptions page of Jobs.

The Stages page is more of a preview, and in order to view the details of each stage, you also need to enter the stage details page through “Description” (we will go into more detail in the next lesson).

Alright, up to this point, we have discussed the different pages in the navigation bar to varying degrees. To summarize, Executors, Environment, and Storage are detail pages that developers can use to quickly understand the overall computational load, the runtime environment, and the details of dataset caching in the cluster. SQL, Jobs, and Stages are more of a listing display, and to delve into the details, it is necessary to enter the secondary entry.

As mentioned at the beginning, we will explore the details of the secondary entry in the next lesson, so stay tuned.

Key Review #

Alright, today’s lesson comes to an end here. Today’s content is quite extensive, covering various and complex metrics. Simply listening to my explanation once is far from enough. You need to combine it with your daily development work to explore and experience more. Keep it up!

In today’s lecture, we started with a simple and direct entry point, and then introduced the details and overview pages of the first-level entry points in the order of “Executors -> Environment -> Storage -> SQL -> Jobs -> Stages”. For the content on these pages, I have summarized the key points that need to be mastered and organized them into the following table for your reference.

Practice for Each Lesson #

Today’s reflection questions have already been discussed in the course. The first one is about why the number of RDD Blocks is not the same as the number of Complete Tasks on the Executors page. The second one is about why there is an extra save Action on the Jobs page.

Feel free to discuss and explore with me in the comments section. Also, feel free to recommend this lesson to friends and colleagues who may need it.