02 the Essence of Performance Tuning Five Flowers Eight Doors Tuning Techniques to Start With

02 The Essence of Performance Tuning - Five Flowers Eight Doors Tuning Techniques to Start With #

Hello, I’m Wu Lei.

In the previous lesson, we discussed the necessity of performance tuning and concluded that although Spark itself runs efficiently, as developers, we still need to tune the performance of our applications.

So, how should we go about performance tuning? With hundreds or thousands of lines of application code and nearly a hundred Spark configuration options, where should we start? I believe that in order to understand how to approach performance tuning, we must first understand the essence of performance tuning.

Therefore, in today’s lesson, we will start with a counterexample that is easy to grasp, and together we will explore and summarize the essence of performance tuning. Ultimately, we will help you establish a systematic methodology for performance tuning.

An Example of the “Preconception Bias” #

In typical ETL scenarios, we often need to perform various transformations on data. Sometimes, due to complex business requirements, we need to customize User Defined Functions (UDFs) to implement specific transformation logic. However, both Databricks’ official blog and countless Spark technical articles found online warn us not to customize UDFs to implement business logic, but to use Spark’s built-in SQL functions as much as possible.

In my daily work, I have noticed that these warnings are repeatedly used in code reviews. When code reviewers review code, once they encounter custom UDFs, they prompt developers to rewrite the business logic using SQL functions. This has become almost a reflex action.

Even the developers themselves find it reasonable. As a result, they spend a lot of time refactoring the business code using SQL functions. Unfortunately, after doing so, there is no significant improvement in the end-to-end execution performance of ETL jobs. This situation is the so-called dilemma where the investment of time does not yield proportional results: a lot of time is spent on optimization, but with little effect.

The main reason for this situation, I believe, is that code reviewers’ understanding of performance optimization is still limited to a superficial level and has not formed a systematic methodology. In order to establish a systematic methodology, we must explore the essence of performance optimization. Otherwise, developers will be like hamsters trapped in a maze, fighting vigorously but unable to find a way out.

Since the essence of performance optimization is so important, what exactly is it?

The Essence of Performance Tuning #

Exploring the essence of any phenomenon requires a process of observing individual cases, attributing phenomena, and summarizing and generalizing.

Let’s first analyze the counter-example above: why did refactoring UDF with SQL functions not yield the expected results as described in books?

Is the advice itself incorrect? Definitely not. By comparing the execution plans, we can clearly see the difference between UDF and SQL functions. Spark SQL’s Catalyst Optimizer can clearly understand each step performed by SQL functions, providing ample optimization space. On the other hand, the calculation logic encapsulated in UDF is a black box for the Catalyst Optimizer. Apart from putting the UDF in a closure, there is not much more optimization work that can be done.

So, is it because UDF doesn’t actually incur performance overhead compared to SQL functions? That’s not true either. We can perform performance unit tests by selecting a random custom UDF from your ETL application, attempting to rewrite it using SQL functions. Then, we can prepare unit test data and compare the running times of the two different implementations in a single-machine environment. In most cases, UDF implementations are slower by about 3% to 5%. Therefore, as you can see, there is still performance overhead with UDF.

Given this, why doesn’t optimizing UDF with SQL functions significantly improve the overall performance of end-to-end ETL applications?

According to the barrel theory, the shortest plank determines the capacity of the barrel. Therefore, for a barrel with a short plank, adjusting the other planks higher is useless—the shortest plank is the bottleneck for the barrel’s capacity. For the ETL application in question, the reason why the optimization from UDF to SQL functions has a minimal impact on execution performance is that it is not the shortest plank. In other words, the bottleneck for the end-to-end execution performance of the ETL application is not the developer-customized UDF.

Combining the analysis above, we can summarize the essence of performance tuning into four points:

Performance tuning is not a one-time deal; filling in one short plank may result in other planks becoming new bottlenecks. Therefore, it is a dynamic and continuous process.
The efficiency of performance tuning techniques and methods depends on whether they target the barrel’s long plank or bottleneck. Focusing on the bottleneck yields results more efficiently, while focusing on the long plank yields less efficient results.
Performance tuning methods and techniques do not have fixed rules and are not constant. They need to dynamically switch as the short plank of the barrel changes.
The process of performance tuning converges to a state where all planks are equal and there are no bottlenecks.

Based on this understanding of the essence of performance tuning, we can explain some common phenomena in daily work. For example, we often find that the same tuning method works well for you but not for me. Additionally, when we come across a blog post summarizing optimization experiences from a Spark expert, we try each point meticulously and find that the effects are not as significant as stated in the blog. This does not imply that the expert’s best practices are empty talk, but rather that the points they summarized may not have reached your bottleneck.

What are the approaches to identifying performance bottlenecks? #

You may ask, “Since the key to optimization lies in identifying bottlenecks, how can I identify performance bottlenecks?” In my opinion, there are at least two approaches: prior expert experience and posterior runtime diagnosis. Let’s look at each of them one by one.

Expert experience refers to being able to roughly determine potential performance bottlenecks based on past practical experience during code development or code review. Clearly, such experts are not easy to find. A developer needs to accumulate a lot of experience to become an expert. If you have such a person around you, be sure not to miss out on their expertise!

However, you might say, “If I had such an expert around me, I wouldn’t need to subscribe to this column.” That’s okay. We have a second approach: runtime diagnosis.

There are numerous means and methods for runtime diagnosis. For example, for task execution, Spark UI provides a rich visual panel that displays detailed runtime status data such as DAGs, stage divisions, execution plans, executor load balancing, GC time, and memory cache consumption. For hardware resource consumption, developers can use tools like Ganglia or system-level monitoring tools such as top, vmstat, iostat, iftop, etc. to monitor hardware resource utilization in real time. Particularly regarding GC overhead, developers can import GC logs into JVM visualization tools to get an overview of the frequency and magnitude of GC during task execution.

Regarding these two approaches to identifying performance bottlenecks, experts are like experienced Chinese medicine doctors who can quickly identify the bottleneck through observation, smell, questioning, and examination. Runtime diagnosis is more like the various medical devices and equipment in Western medicine, such as stethoscopes, X-rays, CT scanners, which require quantifiable metrics to quickly locate problems. There is no difference in superiority or inferiority between the two. In fact, combining them is more efficient. It’s like a highly skilled and experienced doctor holding your blood test and B-ultrasound results, which naturally provides more certainty in identifying the lesion.

You might say, “Despite having these two approaches, I still don’t know where to start specifically in terms of bottleneck identification!” Based on past tuning experience, I believe that starting from the perspective of hardware resource consumption is often a good choice. We all know that, from a hardware perspective, computational loads can be divided into CPU-intensive, memory-intensive, and IO-intensive tasks. If we can determine which type our application belongs to, we can naturally narrow down the search range and make it easier to identify performance bottlenecks.

However, in practical development, not all loads can be clearly categorized into resource-intensive types. For example, typical scenarios in the field of data analysis, such as shuffle and data correlation, have high requirements for CPU, memory, disk, and network. Any bottleneck can occur if any of these components is not performing well. Therefore, when it comes to identifying performance bottlenecks, we need to pay attention to typical scenarios in addition to the perspective of hardware resources.

Methods and Means for Performance Tuning #

Assuming that we have successfully identified the performance bottleneck through runtime diagnostics, how exactly should we optimize it? The methods and means for tuning Spark performance are rich and complex. Listing them one by one not only appears tedious, but also hampers the formation of a systematic approach to optimization. Just like when a doctor treats a certain ailment, they often combine internal medicine, external application, and even surgery to achieve the goal of curing the disease.

Therefore, in my opinion, performance tuning in Spark can be approached from two levels: application code and Spark configuration.

The application code level refers to how we make choices and develop applications with performance in mind from the perspective of code development. As we have compared in the previous article, even if two sets of code have the same functionality, there can be significant differences in performance. Therefore, we need to know the common operations and common pitfalls in the development phase in order to avoid leaving potential performance problems in the code.

You are probably familiar with Spark configuration options. The Spark official website lists nearly a hundred configuration options, which can be overwhelming. However, not all configuration options are directly related to performance tuning. Therefore, we need to identify and categorize them.

In general, in our day-to-day tuning work, we often start from the application code and Spark configuration levels. As they say: “The problem comes from the code, and the solution lies in the code.” Tuning at the application code level is actually a process of capturing and removing performance bugs. Spark configuration options give developers great flexibility, allowing them to adjust the utilization of resources on different hardware to adapt to the runtime workload of the application.

As for the methods and techniques for tuning at the application code and Spark configuration levels, as well as the comprehensive application of these techniques in typical scenarios and resource utilization control, I will explain them one by one in the “Performance” article. Starting from different levels, different scenarios, and different perspectives, I will summarize a set of tuning methods and insights to help you carry out performance tuning methodically.

The Conclusion of Performance Optimization #

The essence of performance optimization tells us that performance optimization is a dynamic and continuous process. In this process, the means of optimization need to be switched in response to the changing bottlenecks. So here comes the question, when is performance optimization done? Even if it is continuous, there must be a convergence criterion, right? It can’t just keep going endlessly.

In my opinion, the ultimate goal of performance optimization is to seek synergy and balance among all the hardware resources involved in computation, to achieve a balanced state without bottlenecks.

Let’s take the case of Qubole, a big data service company, as an example. Recently, they integrated the XGBoost machine learning framework into Spark for model training. They compared the execution performance under different configurations with the same hardware resources, data sources, and computing tasks.

From the table below, we can clearly see that the best performing training tasks are not those with a CPU utilization rate squeezed to 100% and the memory set to the maximum, but those with the most balanced hardware resource configuration.

Summary #

Only by understanding the essence of Spark performance tuning and forming a systematic methodology can we avoid the dilemma of investing time without proportional output.

I summarize the essence of performance tuning in 4 points:

Performance tuning is not a one-time deal. Patching one shortcoming may lead to new shortcomings in other areas. Therefore, it is a dynamic and continuous process.
The effectiveness of performance tuning techniques and methods depends on whether they target the long board or the bottleneck of the “bucket”. Focusing on the bottleneck will achieve twice the result with half the effort, while focusing on the long board will achieve half the result with twice the effort.
There are no fixed rules or unchanging methods for performance tuning techniques and tricks. They need to be dynamically switched as the shortcoming of the “bucket” changes.
The process of performance tuning converges to a state where all boards are level and there are no bottlenecks.

I summarize the systematic methodology of performance tuning in 4 principles:

Identify performance bottlenecks through different approaches, such as expert experience or runtime diagnostics.
Apply tuning techniques and methods from different perspectives (typical scenarios) and different levels (application code, Spark configuration) by considering different aspects of hardware resources.
Dynamically and flexibly switch tuning methods between different levels as performance bottlenecks change.
Let the process of performance tuning converge to a state where different hardware resources achieve a balance and there are no bottlenecks at runtime.

Daily practice #

What other “mechanical tuning” techniques have you encountered?
Do you think that the convergence state of performance tuning, the state where hardware resources are balanced and there are no bottlenecks, needs to be quantified? How can it be quantified?

What are your thoughts on performance optimization? Feel free to leave a comment in the comment section. See you next class!