00 Introductory Words Entering the World of Spark in Three Steps

00 Introductory Words - Entering the World of Spark in Three Steps #

Hello, I’m Wu Lei. Welcome to join me in learning Spark.

In the past 7 years, my career development has revolved around Spark. In 2014, Spark swept through the entire big data ecosystem with great momentum, and it was during that time that I came to know Spark. At first, fueled by strong curiosity, I spent a week rewriting my company’s (IBM) ETL tasks using Spark.

To my surprise, the Spark version of the ETL task improved performance by an order of magnitude. Since then, I have been deeply fascinated by Spark and tirelessly learning and practicing everything related to Spark, from official documentation to technical blogs, from source code to best practices, from hands-on experiments to large-scale applications. In this process:

  • At IBM, I used Spark Streaming to build real-time analysis of user behavior, helping business users.
  • At Lenovo Research Institute, I used Spark SQL + Hive to build a company-level data warehouse, serving all business departments.
  • At Weibo, I built a Weibo machine learning framework based on Spark MLlib. The configuration-based development framework freed hundreds of algorithm engineers from the laborious tasks of data processing, feature engineering, and sample engineering, allowing them to focus their valuable time and energy on algorithm research and model optimization.
  • At FreeWheel, we used Spark for data exploration, data processing, feature engineering, sample engineering, and model training in all machine learning projects, landing one machine learning project after another in the business.

In order to master Spark more thoroughly, in my daily work, I am enthusiastic about putting the knowledge I have learned, the skills I have acquired, the pitfalls I have encountered, and the detours I have taken into writing. Through this continuous iterative learning process of “learning, using, and writing,” I have gradually organized scattered development techniques and knowledge points into a structured knowledge system.

In March 2021, I collaborated with Geek Time on the column “Spark Performance Tuning in Practice,” sharing the tips, insights, and best practices I have accumulated in performance tuning with students who need them.

I am gratified that the content of the column has been widely praised by students. Many students have provided feedback that by using the tuning techniques in the column, the performance of their Spark jobs has improved by several times. However, at the same time, some students have also provided feedback that they are just beginners in big data and have difficulty understanding many of the contents in the column.

In fact, I have many classmates around me who come from a machine learning or artificial intelligence background, who are preparing to transition from backend development, DBA, or even other industries to big data development, and who want to build enterprise-level data warehouses based on open-source frameworks. They all face the challenge of how to quickly get started with Spark.

“Fast” and “Full”, Making Spark a Standard for Internet Companies #

However, you may wonder, “Is Spark still popular? Is it outdated?” In fact, after more than a decade of development, Spark has grown from a “big data rookie” to a cornerstone of data application in the field. In the Magic Quadrant for Data Science and Machine Learning by Gartner, an IT research and consulting firm, Databricks (the cloud-native commercial version of Spark) has been nominated as a Market Leader for three consecutive years (2018-2020).

Moreover, with its many advantages, Spark has become the standard for the vast majority of internet companies. For example, ByteDance builds data warehouses based on Spark, which serve almost all of its product lines, including Douyin, Toutiao, Xigua Video, and Huoshan Video. Meituan introduced Spark as early as 2014 and gradually extended it to its core products such as the Meituan App, Meituan Takeout, and Meituan Dache. Netflix uses Spark to build end-to-end machine learning pipelines for its recommendation engine, serving over two hundred million subscribers.

In fact, no internet company can do without typical business scenarios such as recommendation, advertising, and search. Recommendation and search help companies drive traffic, improve user experience, maintain user stickiness, and expand user growth, while advertising is one of the most important business models for monetizing traffic. Behind these business scenarios, you can see the figure of Spark —it is used for ETL and stream processing, building enterprise-level data analysis platforms, and creating end-to-end machine learning pipelines.

So, we can’t help but wonder: “In the rapidly developing field of data applications, with a plethora of similar competitors continuously emerging and evolving, how does Spark stand out among the fierce competition and establish a lasting unbeatable position?” In my opinion, this is mainly due to Spark’s two major advantages: “fast” and “full”.

“Fast” has two aspects: fast development efficiency and fast execution efficiency. Spark supports multiple development languages such as Python, Java, Scala, R, and SQL, and provides a rich variety of development operators such as RDD, DataFrame, and Dataset. These features allow developers to complete data application development like building blocks, effortlessly and skillfully.

Around me, there are many classmates who do not have a big data background but need to start developing with Spark from scratch. Initially, they often need to “copy and paste” and refer to others’ code to complete their work. However, after just three months of intensive practice, most of them can independently and proficiently implement various business requirements. And this is naturally attributed to the super high development efficiency of the Spark framework itself.

Furthermore, thanks to the parallel development of Spark Core and Spark SQL as the computational engines, the data applications we develop do not require many adjustments or optimizations but still enjoy decent execution performance.

This is mainly due to the continuous polishing and optimization of the underlying computational engine by the Spark community, which enables developers to focus on implementing business logic without worrying about the design details at the framework level.

After discussing Spark’s “fast”, let’s talk about its “full”. “Full” refers to Spark’s comprehensive support for various computation scenarios. We know that in the field of data applications, there are several computation scenarios, including batch processing, stream processing, data analysis, machine learning, and graph computing.

Batch processing, as the foundation of big data, goes without saying. Unlike at any other time, today’s big data processing requires increasingly low latency, and the basic concepts and working principles of stream processing are essential skills for every big data professional. At the current peak of artificial intelligence, data analysis and machine learning are also the top priorities we must focus on.

For these computation scenarios, Spark provides various sub-frameworks for support. For example, Structured Streaming is used for stream processing, Spark SQL is for data analysis, and Spark MLlib serves machine learning, and so on. The comprehensive scenario support of Spark allows developers to achieve different types of data applications “without leaving home” within the same computational framework, avoiding the need to chase various new technologies and frameworks in order to implement different types of data applications.

It is not difficult to see that Spark encompasses numerous advantages and has profound influence on the internet. For students who want to make achievements in the field of data applications, Spark can be considered a mandatory course.

Whether you are a big data engineer focused on application development and secondary development, or a data analyst, data scientist, or machine learning algorithm researcher who is becoming increasingly popular, Spark is an indispensable skill you must master.

However, although Spark has many advantages, it is not easy to get started with Spark. Fellow classmates around me often have the following complaints:

  • There are too many learning materials online, but most of them consist of scattered knowledge points, making it difficult to build a structured knowledge system.
  • There are actually quite a few books about Spark, but many of them simply explain the principles in a step-by-step manner, which is hard to read.
  • To learn Spark, you have to learn Scala first, but the Scala syntax is obscure and difficult to understand, which is discouraging.
  • There are too many development operators to remember, and when faced with new business requirements, they don’t know where to start.

Since Spark is an essential part of the career development of data application developers and getting started with Spark has its difficulties and pain points, how can we actually get started with Spark?

How to Get Started with Spark? #

If Spark were compared to a racing car, then every developer would be a racing driver preparing to get behind the wheel. To drive this car well, the first step is to familiarize ourselves with the basic operations of driving a vehicle, such as how to shift gears, where the accelerator, clutch, and brake pedals are located, and so on.

Furthermore, in order to harness the performance advantages of the racing car, we need to understand how it works, such as its drive system, brake system, and so on. Only by understanding its working principles can we flexibly manipulate the combination of throttle, clutch, and brake.

Finally, after grasping the basic operations and working principles of the racing car, we need to summarize general techniques for different driving scenarios, such as highways, mountain roads, deserts, and so on. By following this three-step process, we can transform from a racing novice to an experienced driver.

Learning Spark also requires such a “three-step process”. The first step is to become familiar with the commonly used development APIs and operators in Spark, just like becoming familiar with the basic operations of driving a car. After all, through these APIs and operators, we can start and drive Spark’s distributed computing engine.

Next, in order to make the Spark car run smoothly, we must have a deep understanding of how it works. Therefore, in the second step, I will explain the core principles of Spark to you.

The third step, just like dealing with different driving scenarios, we need to understand and be familiar with different computational frameworks in Spark (Spark SQL, Spark MLlib, and Structured Streaming) to handle different data application scenarios, such as data analysis, machine learning, and streaming computing.

Corresponding to the three-step process, I have designed this course into 4 modules. The first module is the foundation module, where I will focus on the first two steps, namely becoming familiar with the development APIs and mastering the core principles. In the following three modules, I will sequentially explain the computational frameworks of Spark for different data scenarios, namely Spark SQL, Spark MLlib, and Structured Streaming. Since the graph computing framework GraphFrames has fewer applications in the industry, our course does not include an introduction to this part.

The relationship between these four modules and the “three-step process” is shown in the following image:

From the image, you can see that since Spark SQL plays the role of both a data analysis framework and a new-generation optimization engine, other computational frameworks can also share the “performance dividend” brought by Spark SQL. Therefore, when explaining Spark SQL, I will also cover some basic operations and principles from the first and second steps.

In these four modules, we will start with a small project and gradually explain the operators, development APIs, working principles, and optimization techniques involved in the project step by step. Although the code provided for each project is implemented in Scala, you don’t have to worry at all. I will annotate the code line by line, providing a “nanny-level” explanation of the code. The first module is about basic knowledge.

In this module, we will start with a small project called “Word Count”. Using Word Count as a guide, we will explain in detail the meaning, usage, precautions, and applicable scenarios of common RDD operators, allowing you to master RDD operators in one go. I will also use interesting stories to explain Spark’s core principles in a relaxed and easy-to-understand manner, including RDD programming model, Spark process model, scheduling system, storage system, Shuffle management, memory management, etc., so that you can understand Spark as if you were reading a novel.

The second module is about Spark SQL, where I will start with a small project called “Car Number Plate Lottery” to familiarize you with Spark SQL development API. At the same time, based on this project, I will explain the core principles and optimization process of Spark SQL. Finally, we will focus on introducing the parts of Spark SQL related to data analysis, such as data transformation, cleansing, association, grouping, aggregation, sorting, and so on.

In the third module, we will learn Spark’s machine learning sub-framework: Spark MLlib.

In this module, we will start with a small project called “House Price Prediction” to get a preliminary understanding of regression models in machine learning and the basic usage of Spark MLlib. I will also introduce you to the general scenarios in machine learning and together we will delve into Spark MLlib’s rich feature processing functions, list the models and algorithms supported by Spark MLlib, and learn how to build end-to-end machine learning pipelines. Finally, I will explain how the integration of Spark + XGBoost helps developers tackle most regression and classification problems.

In the last part of the course, let’s learn Spark’s streaming framework: Structured Streaming together.

In this module, we will focus on how Structured Streaming ensures both semantic consistency and data consistency, as well as how to deal with data correlation in streaming processing. We will demonstrate typical calculation scenarios in streaming processing through the integration of Kafka + Spark, the “couple” system.

After going through the three steps of “familiarizing with development API, mastering core principles, and exploring the sub-framework”, you will establish your own Spark knowledge system and fully enter the door of Spark application development.

For the vast majority of data application needs, I believe you will be able to flexibly cope with them and deliver a distributed application that meets business requirements, runs stably, and performs well in minutes.

Finally, feel free to speak your mind here, ask questions, and leave me messages. Your encouragement is my motivation. The three-step roadmap has been planned, so let’s join hands and easily and happily complete the journey of getting started with Spark!

Mastering Spark, I firmly believe that it can make you stand out in written tests, interviews, or daily work, thus adding luster to your career development!