21 Spark Ui How to Efficiently Locate Performance Issues

21 Spark UI - How to Efficiently Locate Performance Issues #

Hello, I’m Wu Lei.

So far, we have completed the learning of two modules: the basics and Spark SQL. This means that we have completed the first two steps of the “Three Steps” to get started with Spark. Congratulations! As we have been learning, we gradually realized that Spark Core and Spark SQL, as the execution engine and optimization engine of Spark, bear all types of computational workloads, such as batch processing, stream processing, data analysis, machine learning, and so on.

It is obvious that the stability and efficiency of Spark Core and Spark SQL determine the overall “health” of Spark jobs or applications. However, in our daily development work, we often encounter failures or inefficient execution of Spark applications. For such problems, in order to find the root cause, we often need to rely on Spark UI to obtain the most direct and intuitive clues.

If we consider failed or inefficient Spark applications as “patients,” then the various metrics about applications in Spark UI are like the “medical reports” of these patients. By combining various metrics, developers, acting as “doctors,” can quickly locate the “diseases” based on their experience.

In today’s class, let’s take the application of “multiplier and probability analysis” in the car license plate lottery (you can review the detailed content in [Lesson 13]) as an example. Through illustrations, we will gradually get to know Spark UI, see what key metrics it has, what these metrics mean, and what insights it can provide to developers.

It should be noted that the explanation of Spark UI involves a large number of illustrations, code, and metric explanations, which can be overwhelming. Therefore, to lighten your learning burden, I have divided Spark UI into two parts: Part 1 and Part 2, based on the entry types (primary entry and secondary entry). The explanation of the primary entry is relatively simple and straightforward. In today’s class, we will focus on explaining this part, and the explanation of the secondary entry will be covered in the next class.

Preparations #

Before we formally introduce Spark UI, let’s briefly explain the environment, configuration, and code used in the visual examples. You can refer to the details provided here to reproduce each interface in the “Multiplier and Successful Draw Rate Analysis” case in Spark UI. Then, combined with today’s lecture, you can familiarize yourself with each page and metric in a more intuitive and in-depth way.

Of course, if you don’t have a suitable execution environment at hand, it’s okay. The characteristic of this lecture is that it contains a lot of images. I have prepared a large number of pictures and tables for you later on. These will help you thoroughly understand Spark UI.

Since the volume of the car lottery data is small, our resource requirements are not high. The resources used in the “Multiplier and Successful Draw Rate Analysis” case are as follows:

Next is the code. In the lecture on “Development of Car Lottery Application,” we implemented the calculation logic step by step for the “Multiplier and Successful Draw Rate Analysis.” Let’s review it together.

import org.apache.spark.sql.DataFrame

val rootPath: String = _
// Applicant data
val hdfs_path_apply: String = s"${rootPath}/apply"
// spark is the default SparkSession instance in spark-shell
// Read the source file using the read API
val applyNumbersDF: DataFrame = spark.read.parquet(hdfs_path_apply)

// Successful draw data
val hdfs_path_lucky: String = s"${rootPath}/lucky"
// Read the source file using the read API
val luckyDogsDF: DataFrame = spark.read.parquet(hdfs_path_lucky)

// Filter the successful draw data after 2016 and extract only the "carNum" field
val filteredLuckyDogs: DataFrame = luckyDogsDF.filter(col("batchNum") >= "201601").select("carNum")

// Join the lottery data and the successful draw data using the inner join, with the join key being "carNum"
val jointDF: DataFrame = applyNumbersDF.join(filteredLuckyDogs, Seq("carNum"), "inner")

// Group by batchNum and carNum, and calculate the multiplier factor
val multipliers: DataFrame = jointDF.groupBy(col("batchNum"),col("carNum"))
.agg(count(lit(1)).alias("multiplier"))

// Group by carNum and keep the maximum multiplier factor
val uniqueMultipliers: DataFrame = multipliers.groupBy("carNum")
.agg(max("multiplier").alias("multiplier"))

// Group by multiplier factor and calculate the number of people
val result: DataFrame = uniqueMultipliers.groupBy("multiplier")
.agg(count(lit(1)).alias("cnt"))
.orderBy("multiplier")

result.collect

Today, we will make a slight change on this basis. In order to facilitate the display of the StorageTab page, we have “forced” both the applyNumbersDF and luckyDogsDF DataFrames to be cached. It is actually unnecessary to cache datasets with a reference count of 1, so please pay attention to this.

After reviewing the code, let’s take a look at the configuration. In order to allow Spark UI to display running and completed applications, we also need to set the following configuration items and start the History Server.

// SPARK_HOME represents the Spark installation directory
${SPARK_HOME}/sbin/start-history-server.sh

Alright, everything is ready here. Next, let’s start spark-shell, submit the code for “Multiplier and Successful Draw Rate Analysis,” and then turn our attention to port 8080 of Host1, which is the port of the driver node.

Spark UI Main Entry #

Today’s story starts with the main entry of Spark UI. As mentioned earlier, the port 8080 is the entry point for Spark UI, and we can access Spark UI from here.

When opening Spark UI, the first page you see is the default Jobs page. The Jobs page records the Actions involved in the application, as well as the actions related to data reading and movement. Each Action corresponds to a Job, and each Job corresponds to a task. We will explore the Jobs page later. For now, let’s focus on the top navigation bar of Spark UI. It lists all the main entry points of Spark UI, as shown in the image below.

The leftmost part of the navigation bar is the Spark Logo and version number. Behind that are six main entry points, each with its own functionality and purpose. I have summarized their functions and purposes in the table below. Take a look at the table as a whole, and we will discuss each entry point in detail later.

In simple terms, these six different entry points are like the six different examination items in a medical report, such as internal medicine, surgery, and blood routine, etc. Next, let’s go through each item in the “medical report” in the order of Executors > Environment > Storage > SQL > Jobs > Stages.

Among them, the first three entry points are detail pages and do not have secondary entry points. On the other hand, the last three entry points are overview pages and require access to secondary entry points to get more detailed information. Clearly, detail pages are more direct than overview pages. Next, let’s start with Executors to understand the computational workload of the application.

Executors #

The Executors Tab contains two main sections, “Summary” and “Executors”. Both sections contain the same metrics, with “Executors” providing more detailed information about each individual executor, while the “Summary” section provides a simple sum of all the executor metrics.

Let’s take a look at the metrics provided by Spark UI to quantify the workload of each executor. To make it easier to follow, I have listed these metrics and their meanings in the table below.

It’s easy to see that the Executors page clearly records the amount of data consumed by each executor, as well as their consumption of CPU, memory, and disk resources. Based on this information, we can easily determine if there is any workload imbalance between different executors and identify the potential for data skew in the application.

Regarding the specific values of each metric on the Executors page, they are actually the aggregation of task execution metrics at the executor level. Therefore, the detailed interpretation of these metrics will be discussed in the Stages section, and we won’t go into them here. Take some time to browse through the specific values of different metrics in the context of the “rate and probability analysis” application and get an intuitive sense of these numbers.

In fact, there is nothing particularly special about these specific values, except for two metrics: RDD Blocks and Complete Tasks. If you take a closer look at these two metrics, you will notice that RDD Blocks is 51 (total) and Complete Tasks (total) is 862.

Previously, when we talked about RDD parallelism, we mentioned that RDD parallelism is the number of partitions of the RDD, and each partition corresponds to a task. Therefore, RDD parallelism is consistent with the number of partitions and the number of distributed tasks. However, the values 51 and 862 in the screenshot are clearly not in the same order of magnitude. Why is that?

Let’s leave this as a thought exercise for now. Take some time to think about it. If you haven’t figured it out, it’s okay. We will continue discussing this question in the comments.

Environment #

Next, let’s talk about the Environment. As the name suggests, the Environment page records various environment variables and configuration information, as shown in the image below.

To keep it focused, I haven’t shown all the information included in the Environment page. In terms of categories, it includes five types of environment information, which I have listed in the table below for your convenience.

Clearly, among these five types of information, Spark Properties are the most important, as they record all the Spark configuration settings that take effect during runtime. Through Spark Properties, we can confirm if the runtime settings match our expectations, and thus eliminate stability or performance issues caused by incorrect configuration settings.

Storage #

After discussing Executors and Environment, let’s take a look at the last detail page of the primary entry: Storage.

The Storage detail page records the details of each distributed cache (RDD Cache, DataFrame Cache), including cache level, cached partition number, cache fraction, memory size, and disk size.

In [Lesson 8], we introduced the different cache levels supported by Spark. It is a combination of storage media (memory, disk), storage format (object, serialized bytes), and replica count. For DataFrame, the default level is a single replica of Disk Memory Deserialized, as shown in the image above, which means the storage media is memory plus disk and the storage format is object with a single replica.

Cached Partitions and Fraction Cached record the number of partitions that have been successfully cached and the proportion of these cached partitions to all partitions, respectively. When Fraction Cached is less than 100%, it means that the distributed dataset is not fully cached in memory (or on disk). In such cases, we need to be cautious about the potential performance impact of cache eviction.

The Size in Memory and Size in Disk provide a more intuitive distribution of the dataset’s cache in memory and on disk, respectively. From the image above, we can see that due to limited memory (3GB/Executor), almost all of the lottery data is cached on disk, with only 584MB of data cached in memory. To be honest, this kind of cache does not bring substantial performance benefits for repeated access to the dataset.

Based on the detailed information provided on the Storage page, we can adjust the configuration options related to memory, such as spark.executor.memory, spark.memory.fraction, and spark.memory.storageFraction, in a targeted manner to adjust the Storage Memory performance.

SQL #

Next, let’s continue with the SQL page of the primary entry. When our application includes DataFrame, Dataset, or SQL, the Spark UI’s SQL page will display the corresponding content, as shown in the image below.

Specifically, the primary entry page records the Spark SQL execution plans for each action as units of Actions. We need to click on the hyperlinks in the “Description” column to enter the secondary page and understand the detailed information of each execution plan. We will expand on this in the next lesson’s secondary entry detail page.

Jobs #

Similarly, for the Jobs page, the Spark UI also records the execution details of each job as units of Actions. To understand the job details, we must also use the secondary entry links provided by the “Description” page. Just have a preliminary understanding for now, and we will delve into it in the next lesson.

Compared to the three Actions on the SQL page: save (save computation results), count (count application numbers), and count (count lottery numbers), you can see from the screenshot of the overview page that the Jobs page seems to have many more Actions out of thin air.

The main reason is that on the Jobs page, Spark UI also regards data reading, accessing, and moving as a type of “Actions”, such as those with Job Id 0, 1, 3, 4 in the image. These jobs are actually reading the source data (metadata and the dataset itself).

As for the additional Job with Job Id 7, you might want to think about it in terms of the last line of code. I’ll keep it a secret for now and give you enough time to think about it. See you in the comments.

result05_01.write.mode("Overwrite").format("csv").save(s"${rootPath}/results/result05_01")

Stages #

As we know, each job consists of multiple stages. These stages are listed on the Stages page of the Spark UI, belonging to different jobs. To see which stages belong to which job, you also need to view them from the Jobs’ Descriptions secondary entry.

The Stages page is more of a preview. To view the details of each stage, you also need to enter the Stage details page from the “Description” (we will discuss this in detail in the next lesson).

Alright, that’s it. We have covered different pages in the navigation bar to some extent. To summarize, Executors, Environment, and Storage are detail pages that allow developers to quickly understand the overall computational load, runtime environment, and detailed information of dataset caching in the cluster. SQL, Jobs, and Stages are more of a listing display, and to understand the details, you need to enter the secondary entry.

As mentioned at the beginning, we will discuss the secondary entry in the next lesson. Stay tuned.

Key Points Review #

Alright, we have finished today’s lesson here. Today’s content is quite extensive and involves various and complex metrics. Just listening to my explanation once is far from enough. You still need to combine your daily development and explore more and gain experience. Keep it up!

In today’s lecture, we started with the simple and direct primary entry point, and introduced the details page and overview page of each primary entry point in the following order: “Executors -> Environment -> Storage -> SQL -> Jobs -> Stages”. For the content in these pages, I have summarized the key points that need to be mastered and organized them into the following table for your reference at any time.

Practice for Each Lesson #

Today’s question for reflection has already been mentioned in the course. One is on the Executors page, why is there a discrepancy between the number of RDD Blocks and Complete Tasks. The second one is on the Jobs page, why is there an additional save Action at the end?

Feel free to discuss and exchange ideas with me in the comments section. Also, please feel free to recommend this lecture to friends or colleagues who may find it useful.