00 Introduction to Spark Performance Tuning Master These Pitfalls

00 Introduction to Spark Performance Tuning - Master These Pitfalls #

Hello, I’m Wu Lei, and I welcome you to explore the performance optimization of Spark applications with me.

In June 2020, Spark officially released a new version, jumping directly from 2.4 to 3.0. The highlight of this major upgrade lies in performance optimization, with the addition of new features such as Adaptive Query Execution (AQE), Dynamic Partition Pruning (DPP), and Extended Join Hints.

In fact, even before the release of version 3.0, Spark has already become the de facto standard for distributed data processing technology. In the Gartner Magic Quadrant for Data Science and Machine Learning, Databricks (the cloud-native commercial version of Spark) has been nominated as the Market Leader for three consecutive years (2018-2020).

Naturally, Spark has also become a standard tool for major internet companies and plays an indispensable role in processing massive data. For example, ByteDance uses Spark to build data warehouses that serve almost all of its product lines, including TikTok, Toutiao, Xigua Video, and Huoshan Video. Similarly, Baidu has launched BigSQL based on Spark to provide nearly instantaneous ad hoc queries for a massive number of users.

It is foreseeable that the new features brought by this version upgrade will continue to make Spark dominate the big data ecosystem in the next 5 to 10 years.

Speaking of my connection with Spark, it can be traced back to 2014. By chance, I got involved in Spark research. After in-depth study, I became deeply fascinated by Spark’s efficient execution performance, which has tightly integrated my career development with Spark ever since. That’s how I transitioned from the fields of data analysis and data mining to the current field of business intelligence and machine learning, constantly exploring the core value hidden in data.

Currently, I lead a team at FreeWheel responsible for the application and implementation of machine learning. All of our implemented and ongoing projects use Spark for data exploration, data processing, data analysis, feature engineering, and sample engineering. In addition, we also frequently train and deploy models on massive data using Spark.

If, like me, you are planning your career path in the field of data, mastering Spark will undoubtedly be one of your professional goals.

Mastering Spark, you need a versatile key called “Performance Tuning” #

Currently, Spark has five major application scenarios: massive batch processing, real-time streaming computation, graph computation, data analysis, and machine learning. Regardless of which direction you plan to specialize in, performance tuning is a crucial step in your career advancement.

Why is that? The reason is simple: for these five scenarios, improving execution performance is a must.

Graph computation and machine learning often require hundreds of iterations to converge. Without performance guarantees, it is impossible to complete such computations. Streaming computation and data analysis have high requirements for real-time responsiveness. Without efficient execution performance, it is impossible to achieve processing completion in sub-second levels.

Compared to other scenarios, batch processing has the lowest requirements for execution efficiency. However, in the present where data volume is increasing daily to the scale of terabytes or even petabytes, it is simply wishful thinking to complete massive data processing in hours without performance tuning.

Therefore, I believe that these five major scenarios are like five doors, each leading to a different realm. And performance tuning is like a “versatile key”. With this key in hand, you can explore a broader world as if entering an uncharted territory.

Why can’t performance tuning be done by copying? #

In fact, many developers around me are also aware of this. They will search for tutorials online to learn about it. However, currently available materials on Spark performance tuning are not very systematic, or they only talk about some common tuning techniques and methods. And when it comes to the tuning techniques shared by experts, what we do by copying them often fails to achieve the desired effect, for example:

Why does the performance become worse when I use RDD/DataFrame Cache, even though it is all in-memory computing?
Why don’t the tuning techniques that are touted online work for me?
Despite setting a low parallelism, why can’t I increase my CPU utilization?
Why does my application still run out of memory even when I have allocated almost all the memory on the node for Spark?

These problems may seem simple, but they can’t be explained in just one or two sentences. They require us to delve into the core principles of Spark, constantly try out each API and operator, set different configuration parameters, and eventually find the best combination.

So, how can we do this? Next, I will share with you how I learned performance tuning.

When I first started using Spark, I found that it was really efficient! It could achieve the same business functionality that would take thousands of lines of code in MapReduce, with just a few dozen lines of code!

Later on, as customer requirements increased and due to my own “meticulousness,” in order to make the application run faster, I practically went through all the RDD APIs, carefully studying the meaning and operation principles of each operator, summarizing the applicable scenarios for different operators, and comparing the differences and advantages/disadvantages of similar functional operators, such as map and mapPartitions, or groupByKey, reduceByKey, and aggregateByKey.

In addition, I have consulted the Spark official website’s Configuration page countless times, compiling a list of configuration items related to performance, and constantly conducting comparative experiments to compare the execution performance under different parameter configurations. When encountering experimental results that are inconsistent with my understanding, I would go back to repeatedly digest the core principles of Spark, from RDD and scheduling systems, to memory management and storage systems, and to in-memory computing and Shuffle. I would repeat this process tirelessly.

Although there were many failures, the surprising performance improvements made the experience unforgettable. Later on, I shared my experience with colleagues around me, and consciously started organizing the cases I came across while helping them with tuning. From point to line, and from line to area, I gradually understood the framework of performance tuning, and eventually summarized a set of methodologies for performance tuning. This also led to the development of a performance-oriented development habit.

By following this set of methodologies, developers can systematically conduct performance tuning work, achieving better results with less effort. I hope to share it with you in this column.

Learn Fast and Learn Well #

Combining with methodology, I have divided this column into three parts: Theory, Performance, and Practical.

Theory: Focus on the underlying principles of Spark to unravel the mysteries of performance optimization.

Spark has many principles, but I will concentrate on the core concepts that are closely related to performance optimization, including RDD, DAG, scheduling system, storage system, and memory management. I will strive to use the most appropriate stories and analogies with the least amount of content to help you grasp the core principles of these five concepts in the shortest possible time, laying a solid foundation for subsequent performance optimization.

Performance: Interpreting performance optimization techniques from multiple angles using real-world examples.

As we mentioned earlier, Spark has a wide range of use cases, mainly including massive batch processing, real-time streaming computation, graph computation, data analysis, and machine learning. However, among all the sub-frameworks, it is evident that Spark has a strong focus on Spark SQL. Therefore, in the performance section, I will mainly cover two parts.

One part is about the general techniques of performance optimization, including basic principles of application development, configuration settings, shuffle optimization, and resource utilization improvement. Firstly, I will start with common examples to teach you how to improve execution speed without changing the logic of the code. Secondly, I will guide you through the configuration options related to execution efficiency. Then, we will analyze effective strategies for dealing with typical scenarios such as shuffle and data association. Finally, we will explore how to maximize resource utilization from a hardware perspective and improve the overall execution performance of Spark.

Although the development APIs and running principles vary for different application scenarios, the essence and methodology of performance optimization remain the same. Therefore, these techniques are not limited to specific use cases but applicable to all Spark sub-frameworks.

The other part will focus on the field of data analysis, discussing optimization methods and techniques in Spark SQL with the help of built-in optimizations such as Tungsten and AQE, as well as typical scenarios such as data association.

Firstly, I will take you deep into exploring Tungsten, Catalyst optimizer, and the many new features released in Spark 3.0, fully utilizing the existing optimization mechanisms in Spark to take performance optimization to a higher level. Then, I will use typical scenarios in data analysis, such as data cleaning, data association, data transformation, etc., to guide you in formulating optimization strategies and methods, case-by-case.

It is worth mentioning that as the development APIs of all sub-frameworks are gradually shifted to DataFrame, applications developed on each sub-framework will benefit from the performance improvements of Spark SQL. In other words, although these optimization techniques revolve around the field of data analysis, the thinking and methods outlined are equally applicable to other sub-frameworks.

Practical: Building your own distributed application.

In the practical section, I will help you practice our methodologies and optimization techniques by using the “Beijing Vehicle License Plate Lottery” data from 2011 to 2019 as an example. I will guide you step by step to build a distributed application and gain insights into the trends and developments of vehicle license plate lottery from different perspectives. I believe that through this hands-on case, you will have a “quantum leap” in understanding performance optimization techniques and strategies.

In addition to that, I will also periodically address some hot topics: for example, the advantages of Spark compared to Flink and Presto; new features in Spark and new explorations of Spark in the industry. These will help us adapt to changes and seize opportunities.

Finally, I would like to say that I have always hoped to make learning an interesting and relaxed process. Therefore, in this column, I will use small stories and examples to help you understand the core principles of Spark, guide you to develop a performance-oriented mindset, and summarize performance optimization methods and techniques from different perspectives, allowing you to understand Spark as if you were reading a novel.

I also hope that you, like the protagonist in a novel, can use the “techniques and methodologies of optimization” as a secret martial arts manual, overcome various challenges, defeat the endless development problems in your professional career, and take your career to new heights.

Lastly, feel free to express your thoughts and raise your questions here. Your encouragement is my motivation. Let’s take up the universal key of performance optimization together and embark on a new Spark career!