01 Spark Begins With Big Data's Hello World

01 Spark Begins with Big Data’s Hello World #

Hello, I am Wu Lei.

Starting from this lesson, let’s begin by learning the “Fundamental Knowledge” module of Spark, in order to have a comprehensive understanding of the concepts and core principles of Spark. I won’t start by explaining basic concepts such as RDD and DAG. To be honest, these abstract concepts can be dry and boring. It’s difficult for someone who is new to Spark to grasp them. Therefore, why not take a different approach and start with hands-on experience? Let’s use a small example to intuitively understand what Spark can do.

It’s similar to learning a new programming language, where we often start with “Hello World.” I still remember that when I first learned programming, seeing “Hello World” printed on the screen excited me for the whole day. It gave me an inexplicable impulse of “I can change the world.”

In today’s lesson, we will start with the “Hello World” of Big Data and learn how to develop applications on top of Spark. However, the “Hello World” of Big Data is not as simple as printing a string on the screen. It involves counting the occurrences of words in a file and then printing the top 5 most frequent words. In the field, it’s called “Word Count”.

There are two main reasons why we choose Word Count as our first project to enter the world of Spark. First, the Word Count scenario is relatively simple and easy to understand. Second, even though Word Count is small, it covers all the essential elements. With a small Word Count, we can explore many core principles of Spark and help us get started quickly.

Alright, enough said. Let’s officially embark on the Word Count journey.

Preparation #

It is impossible for a clever housewife to cook without rice. In order to do the Word Count, we need to prepare the source file first.

The purpose of doing Word Count is to learn Spark, so the content of the source file is not important. Here, I extracted the introduction to Spark in Wikipedia as our source file. I saved it in the GitHub project associated with the course and named it “wikiOfSpark.txt”. You can download it from here.

In order to run the Word Count example, we also need to deploy the Spark runtime environment locally. Here, “locally” refers to any computing resources you have access to, such as servers, desktop computers, or laptops.

Deploying the Spark runtime environment locally is very simple, so don’t worry even if you have never worked with Spark before. Just follow these three steps, and we can complete the local deployment of Spark.

Download the package: Download the package from the Spark official website, and choose the latest precompiled version.
Unzip: Unzip the Spark package to any local directory.
Configure: Add “${unzip_directory}/bin” to the PATH environment variable.

I have prepared a short video for you about local deployment. You can directly experience it.

Next, let’s confirm whether Spark is successfully deployed. Open a command line terminal and enter the “spark-shell –version” command. If the command successfully prints out the Spark version number, it means we have succeeded, just like this:

In the following practical exercises, we will demonstrate the process of executing Word Count using spark-shell. spark-shell is one of the many ways to submit Spark jobs. We will introduce it in detail in the following courses. For now, you can think of it as the Linux shell in Spark. spark-shell provides an interactive running environment (REPL, Read-Evaluate-Print-Loop), allowing developers to quickly obtain execution results after submitting the source code in a “What You See Is What You Get” manner.

However, it is important to note that spark-shell relies on the Java and Scala language environments during runtime. Therefore, to ensure the successful launch of spark-shell, you need to pre-install Java and Scala locally. The good news is that there is a wealth of information available online about installing Java and Scala. You can refer to those resources for installation. We will not go into detail about the installation steps for Java and Scala in this session.

Overview of Word Count Calculation Steps #

After the preparation, we can start writing code. However, before we start, let’s review the calculation steps of Word Count together, so that we have a clear understanding before we start coding.

As mentioned before, the purpose of Word Count is to count the occurrence of words in a file and print out the top 5 most frequent words. So the first step of Word Count is obviously to read the content of the file, otherwise what are we going to count?

The file we have prepared is wikiOfSpark.txt, which contains a simple introduction to Spark in plain text format. Here is a portion of its content:

We know that reading a file is usually done on a line-by-line basis. It is not difficult to notice that each line of wikiOfSpark.txt contains multiple words.

If we want to count at the “word” level, we need to tokenize the text on each line. After tokenization, each sentence in the file is broken down into individual words. This way, we can group and count the words. This is the process of Word Count, which mainly consists of the following 3 steps:

Read the content: Call the Spark file reading API to load the content of the wikiOfSpark.txt file.
Tokenization: Break down each line of text into words.
Group and count: Group the words and count the occurrences.

After clarifying the calculation steps, we can now call the Spark development API to implement these steps in code, and thus complete the development of Word Count application.

As we all know, Spark supports a wide range of programming languages, such as Scala, Java, Python, and more. You can choose any of these languages based on your personal preference and development habits. Although the development APIs of different languages have slight differences in syntax, Spark’s support for each language is consistent in terms of functionality and performance. In other words, whether you implement Word Count in Scala or Python, the execution results of both versions are the same. Furthermore, under the same computing resources, the execution efficiency of both versions is also the same.

Therefore, for this Word Count example, the choice of programming language is not crucial. Let’s choose Scala. You may say, “I’m not familiar with Spark, let alone Scala. Will it be difficult to understand the Spark application code demonstrated in Scala right from the beginning?”

Actually, there is no need to worry. Scala syntax is concise, and the Scala implementation of Word Count will not exceed 10 lines of code. Moreover, for each line of Scala code in Word Count, I will explain and analyze it step by step. I believe that after going through the code with me, you will be able to quickly “translate” it into your familiar language, such as Java or Python. Additionally, the majority of Spark source code is implemented in Scala. Getting some understanding of Scala’s basic syntax will be helpful for you to read and study the Spark source code later.

Word Count Implementation #

Now that we have chosen a programming language, let’s take a look at how Word Count is implemented step by step: reading the content, tokenizing, and grouping and counting.

Step 1: Reading the Content #

First, we use the textFile method of SparkContext to read the source file, which in this case is wikiOfSpark.txt. The code is shown in the table below:

import org.apache.spark.rdd.RDD

// The underscore "_" is a placeholder that represents the root directory of the data file
val rootPath: String = _
val file: String = s"${rootPath}/wikiOfSpark.txt"

// Read the file content
val lineRDD: RDD[String] = spark.sparkContext.textFile(file)

In this code, you may notice three new concepts: spark, sparkContext, and RDD.

Of these, spark and sparkContext are two different entry points for development:

spark is an entry point instance of SparkSession. In spark-shell, SparkSession is created automatically by the system.
sparkContext is an entry point instance of SparkContext.

In the evolution process of Spark versions, starting from version 2.0, SparkSession replaced SparkContext as the unified entry point. In other words, to develop a Spark application, you must first create a SparkSession. I will provide a more detailed explanation about SparkSession and SparkContext in the following lessons. For now, just remember that they are necessary entry points.

Now let’s take a look at RDD. RDD stands for Resilient Distributed Dataset, which is a unified abstraction of distributed data in Spark. It defines a set of basic properties and processing methods for distributed data. We’ll delve into the definition, meaning, and purpose of RDD in the next lesson.

For now, you can think of RDD as an “array.” For example, in the code above, the variable lineRDD has a type of RDD[String]. You can temporarily treat it as an array with String elements, where each element represents a line in the file.

After obtaining the file content, the next step is tokenization.

Step 2: Tokenization #

“Tokenization” means breaking the elements of the “array” into words. To accomplish this, we can use the flatMap method of RDD. The flatMap operation can be logically divided into two steps: mapping and flattening.

What do these two steps mean? Let’s use the example of Word Count:

// Tokenization on a per-line basis
val wordRDD: RDD[String] = lineRDD.flatMap(line => line.split(" "))

To convert the line elements of lineRDD into words, we first need to split each line element using a delimiter (Split). Here, the delimiter is a space.

After splitting, each line element becomes an array of words, and the element type changes from String to Array[String]. This operation, which transforms each element, is called “mapping.”

After mapping, the RDD type changes from RDD[String] to RDD[Array[String]]. If you consider RDD[String] as an “array,” then RDD[Array[String]] is a “two-dimensional array,” where each element is a word.

Mapping RDD

To perform grouping on words, we still need to “flatten” this “two-dimensional array,” which means removing the nested structure and restoring the “two-dimensional array” to a “one-dimensional array,” as shown in the figure below.

Flattening RDD

So, with the help of the flatMap transformation, the lineRDD, which used to have lines as elements, is transformed into the wordRDD with words as elements.

However, it’s worth noting that when splitting sentences using spaces, empty strings may be generated. Therefore, after the “mapping” and “flattening” operations, we need to filter out these empty strings. We can use the filter method of RDD to achieve this:

// Filter out empty strings
val cleanWordRDD: RDD[String] = wordRDD.filter(word => !word.equals(""))

In this way, during the tokenization phase, we obtain an RDD of words after filtering out empty strings. The type is RDD[String]. Now, we can prepare for grouping and counting.

Step 3: Grouping and Counting #

In the RDD development framework, aggregations such as counting, summing, and averaging depend on Key-Value Pair types of data elements, i.e., elements in the form of (Key, Value) arrays.

Therefore, before using an aggregation operator for grouping and counting, we need to convert RDD elements to (Key, Value) format, i.e., map RDD[String] to RDD[(String, Int)].

Here, we set all the Values to 1. As a result, for the same word, we can simply accumulate the Values in the subsequent counting operation.

Below is an example of doing grouping and counting:

// Grouping and counting the words
val wordCountRDD: RDD[(String, Int)] = cleanWordRDD.map(word => (word, 1)).reduceByKey(_ + _)

In the code above, we use the map transformation to transform each word into a (Key, Value) pair, where the Key is the word itself and the Value is 1. This step converts RDD[String] into RDD[(String, Int)].

After this transformation, we call the reduceByKey method to group the words by Key and sum up the corresponding Values. In the example above, (_ + _) is an anonymous function that represents the summation of the Values. The result is an RDD[(String, Int)], where the Key is the word and the Value is the count.

Finally, we have successfully implemented the Word Count program using Apache Spark.

Below is the corresponding code:

// Convert RDD elements to (Key, Value) format
val kvRDD: RDD[(String, Int)] = cleanWordRDD.map(word => (word, 1))

In this way, the RDD is transformed from cleanWordRDD, which originally stored String elements, to kvRDD, which stores (String, Int) elements.

After the transformation, we can proceed to perform the actual group counting. Group counting consists of two steps: “grouping” and “counting”. Below, we use the aggregator reduceByKey to achieve both group and count operations simultaneously.

For the key-value pair array kvRDD, reduceByKey first groups the elements based on the keys (i.e., words). After grouping, each word will have a corresponding list of values. Then, based on the provided user-defined aggregation function, reduce is used to aggregate the values of each key.

In this case, you can understand reduce as a computational step or method. When we provide the aggregation function, it will use the folding approach to transform a list containing multiple elements into a single element value, thus counting the occurrences of different elements.

In the Word Count example, the code for performing group counting using reduceByKey is as follows:

// Group and count by word
val wordCounts: RDD[(String, Int)] = kvRDD.reduceByKey((x, y) => x + y)

As you can see, we pass the accumulation function (x, y) => x + y to the reduceByKey operator, which is an addition function. Therefore, after each word grouping, reduce will use the accumulation function to fold and compute all elements in the value list, ultimately converting the list into the frequency of the word. The computation process is the same for any word, as shown in the following diagram.

After the calculation, reduceByKey still returns an RDD of type RDD[(String, Int)]. However, unlike kvRDD, the value of each element in wordCounts records the frequency of each word. With this, we have completed the development and implementation of the main logic of Word Count.

At the end of the program, we need to sort wordCounts by frequency and print the top 5 most frequent words on the screen. The code is as follows:

// Print the top 5 words with the highest frequency
wordCounts.map{case (k, v) => (v, k)}.sortByKey(false).take(5)

Code Execution #

After completing the application development, we can run the code in the local Spark deployment environment we have prepared. First, open the command line terminal (Terminal) and enter “spark-shell” to open the interactive runtime environment, as shown in the following image.

Then, enter the code we developed into spark-shell one by one. For your convenience, I have organized the complete code implementation as follows:

import org.apache.spark.rdd.RDD

// Use an underscore "_" as a placeholder for the data file's root directory
val rootPath: String = _
val file: String = s"${rootPath}/wikiOfSpark.txt"

// Read the file content
val lineRDD: RDD[String] = spark.sparkContext.textFile(file)

// Tokenize by line
val wordRDD: RDD[String] = lineRDD.flatMap(line => line.split(" "))
val cleanWordRDD: RDD[String] = wordRDD.filter(word => !word.equals(""))

// Convert RDD elements to (Key, Value) format
val kvRDD: RDD[(String, Int)] = cleanWordRDD.map(word => (word, 1))
// Group and count by word
val wordCounts: RDD[(String, Int)] = kvRDD.reduceByKey((x, y) => x + y)

// Print the top 5 words with the highest frequency
wordCounts.map{case (k, v) => (v, k)}.sortByKey(false).take(5)

After entering the above code into spark-shell, spark-shell will finally print the top 5 words with the highest frequency on the screen:

In the Wikipedia text introducing Spark, the most frequent words are “the,” “Spark,” “a,” “and,” and “of.” Except for “Spark,” the other 4 words are common stop words, so it is not surprising that they are at the top of the list.

Congratulations! With Spark, we have completed the development and implementation of the “Hello World” of the big data field, and you have entered the realm of big data development!

Key Takeaways #

In today’s session, we focused on Word Count and explored and experienced Spark application development. The first thing you need to master is the local deployment of Spark, so that you can quickly familiarize yourself with Spark and get a “first impression” of it through spark-shell. To deploy Spark locally, you need to follow 3 steps:

Download the installation package from the Spark official website and choose the latest precompiled version;
Unpack the Spark installation package to any local directory;
Configure “${unpacked directory}/bin” in the PATH environment variable.

Then, we analyzed and implemented our first Spark application together: Word Count. In our example, the Word Count program first needs to count the occurrences of words in a file and then print the top 5 words with the highest frequency. The implementation process consists of 3 steps:

Reading content: Call the Spark file reading API to load the content of the wikiOfSpark.txt file;
Tokenization: Split sentences into individual words;
Grouping and counting: Group words and count the occurrences.

Maybe you are not familiar with the RDD API, or you have never worked with Scala, but that’s okay. By completing this “Hello World” development journey in big data, you have embarked on a new adventure. In the upcoming courses, let’s explore Spark step by step, like discovering a new continent. Keep up the good work!

Practice for Each Lesson #

In the implementation of the Word Count code, we have used a variety of RDD operators, such as map, filter, flatMap, and reduceByKey. Besides these operators, do you know any other commonly used RDD operators? (Hint, you can refer to the official website for more information).

Furthermore, can you talk about the common characteristics or common points of these operators mentioned above?

Feel free to share your answer in the comments, I’ll be waiting in the comments section.

If this lesson has been helpful to you, feel free to share it with your friends. See you in the next lesson!