23 Spark Mllib Starting With Housing Price Prediction

23 Spark MLlib - Starting with Housing Price Prediction #

Hello, I am Wu Lei.

Starting from today’s lecture, we will enter the third module of the course: Spark MLlib machine learning. In the current hot field of data science, machine learning, and artificial intelligence, accumulating knowledge in machine learning is beneficial for expanding our horizons and providing new avenues for career development.

In this module, we will first start with a small project called “House Price Prediction” to get a preliminary understanding of machine learning and the basic usage of Spark MLlib. Next, we will focus on two key steps in machine learning: feature engineering and model tuning. While diving deeper into Spark MLlib, we will further optimize the model performance of “House Price Prediction” to make the price predictions more accurate.

After becoming familiar with the key steps, we will then discuss the general approach of efficiently building machine learning pipelines within the framework of Spark MLlib. Alright, enough talking. Let’s take a look at the “House Price Prediction” project together.

To ensure the authority and representativeness of the project, I have chosen the “House Prices - Advanced Regression Techniques” competition on Kaggle (a data science competition platform). The requirements of this project are to train a house price prediction model based on 79 attribute features of houses and historical house prices, and validate the model’s prediction performance on a test set.

Data Preparation #

Although the requirements of the project are quite clear, you may say, “I don’t have a background in machine learning. What are these features, models, test sets, and performance validation mentioned above? I have no concept of them. How can I learn from the following courses?” Don’t worry, as the course progresses, I will gradually explain these concepts to you.

Next, let’s first get a visual understanding of the housing data in the project.

The housing data records housing transaction data in Iowa, USA from 2006 to 2010, including 79 housing attributes and the corresponding transaction prices at that time. You can download them from the data page of the competition project.

After downloading and decompressing the data, we will obtain four files: data_description.txt, train.csv, test.csv, and sample_submission.csv. These four files are small in size, with a total size of no more than 5MB, and their contents and meanings are shown in the following table.

Among them, train.csv and test.csv have the same schema, both containing 79 housing attribute fields and a transaction price field. The description file details the meanings and value ranges of the 79 fields. The only difference between train.csv and test.csv is their purpose. train.csv is used for training the model, while test.csv is used for validating the model’s performance. The sample_submission.csv file is used to submit competition results, but since we don’t plan to participate in the competition for now, we can temporarily ignore this file.

Now that we have mentioned some terms related to machine learning, such as “training data,” “test data,” and “model performance,” let’s briefly introduce machine learning to accommodate students who lack a background in machine learning.

Introduction to Machine Learning #

However, before formally introducing machine learning, let’s first consider the process of human learning and then examine the similarities between machines and humans in terms of learning.

In the process of growing up, individuals constantly absorb experiences and lessons, either through books or past experiences, and then summarize general principles for interacting with others and behaving in society, and apply these principles to the rest of their lives. This is how human learning and growth usually occurs.

In fact, the process of machine learning is similar. Based on historical data, machines use certain algorithms to try to discover and capture general patterns from the data. Then, they apply these patterns to newly generated data to make predictions and judgments.

Now that we have a basic understanding of machine learning, let’s give it a formal definition to better understand it in a more rigorous way.

Machine learning refers to a computational process in which, given training data, a prior data distribution model is chosen, and then model parameters are continuously adjusted automatically with the help of optimization algorithms in order to make the model approximate the original distribution of the training data.

This continuous adjustment of model parameters is called “model training”. Model training relies on optimization algorithms that adjust the model parameters automatically in an iterative manner based on past computation errors (loss). As model training is an ongoing process, a convergence condition is naturally needed to end the training process. Once the convergence condition is triggered, the model training is declared complete.

After model training is completed, we usually use a new dataset (testing samples) to test the predictive capability of the model in order to verify the effectiveness of the training. This process is called “model testing”.

At this point, your brain may be overwhelmed by various machine learning terms. Don’t worry, let’s use the example of house price prediction to better understand these concepts.

In the house price prediction project, there are four data files, and “train.csv” is the training data that is used to train the machine learning model. Correspondingly, “test.csv” is the testing data used to verify the effectiveness of our model’s training.

More rigorously, testing data is used to examine the generalization ability of the model, which means we need to know whether the model’s predictive capability is consistent with its performance on the training data for data that the model has never seen before.

The schemas of “train.csv” and “test.csv” are identical, both containing 81 fields, including 79 house attributes, 1 transaction price, and an ID field. In the house price prediction project, our task is to select a data distribution model in advance, then train it on the training data (model training), and after the model parameters converge, use the trained model to evaluate its performance on the testing set.

House Price Prediction #

The theory is always less direct than practice. Next, let’s use the Spark MLlib machine learning framework to implement the “House Price Prediction” machine learning project. At the same time, as the project progresses, we will combine specific implementations to deepen our understanding of the basic concepts and common terminology mentioned earlier.

Model Selection #

So, which models can we choose from? Which one should we choose for the house price prediction project? As for how to choose the right model, we collectively call it “model selection”.

In the field of machine learning, there are many types of models, and there are also different methods of classifying models. According to the fitting ability, there is a distinction between linear models and non-linear models; according to the prediction target, there is a distinction between regression, classification, clustering, and mining; according to the complexity of the model, the model can be divided into classical algorithms and deep learning; according to the model structure, it can be divided into generalized linear models, tree models, neural networks, and so on. There are countless categories.

However, our focus is on introductory machine learning and introductory Spark MLlib, so we will leave the topic of models and algorithms in machine learning to Lesson 24. Here, you only need to know that there is such a thing as “model selection”.

In the “House Price Prediction” project, our prediction target (Label) is the house price, which is a continuous numerical field. Therefore, we need a regression model to fit the data. Moreover, among all the models, linear models are the simplest. Therefore, following the principle of starting from the basics, in the first version of the implementation, let’s choose linear regression model to fit the linear relationship between house price and house attributes.

Data Exploration #

To accurately predict house prices, we need to determine which factors among the attributes related to houses have the greatest impact on house prices. During the model training process, we need to select the factors that have a significant impact and eliminate the interference from factors with a small impact.

Combining the example used here, for house prices, the floor area of the house is definitely an important factor. On the contrary, the type of road surface on the street (cement road, asphalt road, or brick road) is not so important for house prices.

In the field of machine learning, the attributes related to the prediction target are collectively referred to as “data features”, and the process of selecting effective features is called “feature selection”. Before doing feature selection, we naturally need to explore the data preliminarily in order to draw conclusions.

The specific exploration process is as follows. First, we use the read API of SparkSession to create a DataFrame from the train.csv file, and then call the show and printSchema functions to observe the sample composition and schema of the data.

Since there are many data fields, it is not convenient to stack the printed data sample and schema in the document. Therefore, I will leave this exploration step for you to try. You can enter the following code into the spark-shell to observe what the data looks like.

import org.apache.spark.sql.DataFrame

val rootPath: String = _
val filePath: String = s"${rootPath}/train.csv"

// Create DataFrame from CSV file
val trainDF: DataFrame = spark.read.format("csv").option("header", true).load(filePath)

trainDF.show
trainDF.printSchema

By observing the data, we will find that the attributes of the houses are very rich, including factors such as the floor area of the house, the number of rooms, the condition of the street surface, the type of house (apartment or villa), infrastructure (water, electricity, gas), the surrounding area (supermarket, hospital, school), the type of foundation (brick or steel), the basement area, the above-ground area, the type of kitchen (open or closed), the garage area and location, the most recent transaction time, and so on.

Data Extraction #

In theory, to select the features that have a major impact on house prices, we need to calculate the correlation between each feature and the house price. However, in the first version of the implementation, we focus on the basic usage of Spark MLlib and do not prioritize the model’s performance.

Therefore, let’s keep it simple and only select the numerical features (which are simple and direct, suitable for beginners), such as the floor area, the above-ground area, the basement area, and the garage area, which are “LotArea”, “GrLivArea”, “TotalBsmtSF”, and “GarageArea” as shown in the table below. We will leave rigorous feature selection to the next lesson on feature engineering.

import org.apache.spark.sql.types.IntegerType

root
 |-- SalePriceInt: integer (nullable = true)
 |-- features: vector (nullable = true)

root
 |-- SalePriceInt: integer (nullable = true)
 |-- features: vector (nullable = true) // Note that the field type of 'features' is Vector
*/

After combining the feature vectors, we have training samples for model training. It contains two types of data: the feature vector ‘features’ and the target variable ‘SalePriceInt’.

Next, we split the training samples proportionally into two parts: one part for model training and the remaining part for initial model validation.

val Array(trainSet, testSet) = featuresAdded.randomSplit(Array(0.7, 0.3))

Split the training samples into a training set and a validation set

Model Training #

With the training samples prepared, we can now use Spark MLlib to build a linear regression model. In fact, building and training a model using Spark MLlib is very simple and straightforward, requiring only three steps.

The first step is to import the relevant model libraries. In Spark MLlib, the linear regression model is implemented by the LinearRegression class. The second step is to create an instance of the model and specify the necessary information for model training. The third step is to call the fit function of the model, providing the training dataset, to start training.

import org.apache.spark.ml.regression.LinearRegression

// Build a linear regression model and specify the feature vector, target variable, and number of iterations
val lr = new LinearRegression()
.setLabelCol("SalePriceInt")
.setFeaturesCol("features")
.setMaxIter(10)

// Train the linear regression model using the training set trainSet
val lrModel = lr.fit(trainSet)

In the second step, we first create an instance of LinearRegression, and then use the setLabelCol and setFeaturesCol functions to specify the target variable field and feature vector field, i.e., “SalePriceInt” and “features”, respectively. Next, we call the setMaxIter function to specify the number of iterations for model training.

Here, it is necessary to explain the concept of iterations. In the previous introduction to machine learning, we mentioned that model training is a continuous process, where the training process scans the same data repeatedly and updates the parameters (also known as weights) in the model. This iterative process continues until the model’s prediction performance meets certain criteria, and then the training can be stopped.

Regarding the criteria, there are two aspects to consider. One aspect is the requirement for prediction error. When the prediction error of the model is smaller than a pre-determined threshold, the model iteration can converge and training can be stopped. The other aspect is the requirement for the number of iterations. Regardless of the prediction error, as long as the pre-determined number of iterations is reached, the model training is considered complete.

At this point, you might be wondering: “There are some new concepts again, model iteration, model parameters… What exactly is the training process of a model?” To help you better understand model training, let me give you a real-life example.

In fact, model training in machine learning is no different from using a microwave oven in our daily life. Let’s say we have an old model microwave oven with only two knobs—one for controlling temperature and the other for setting the heating duration.

Now, suppose we want to bake a pie for dinner to satisfy our hunger. We only have one pie for dinner, which sounds a bit unfortunate. However, we still have high expectations for the taste—we want a pie that is crispy on the outside and tender on the inside.

As shown in the figure above, for us with zero cooking experience, in order to obtain a perfectly baked pie, we can only prepare the pie crusts repeatedly, put them in the microwave oven repeatedly, and continuously try different combinations of temperature and duration until we bake a delicious pie with a crispy exterior and tender interior. Only then can we obtain the optimal combination of temperature and duration.

After determining the successful combination of temperature and duration, when we need to bake other similar foods (such as meat patties or pizza) again, we can simply put them in the microwave oven and press the start button.

Model training is similar. We repeatedly “feed” the model algorithm with training data, adjust the model parameters repeatedly, until the prediction error is reduced to a certain range or the model iteration reaches a certain number, indicating the end of training. When there is new data to predict, we feed it to the trained model and the model generates a prediction result.

However, unlike manually adjusting the “temperature” and “duration” knobs repeatedly, the adjustment of model weights often relies on an optimization algorithm called “Gradient Descent”. In each iteration of the model, the Gradient Descent algorithm automatically adjusts the model weights without human intervention. We will delve into this optimization algorithm in the 24th lecture on model training.

It is not difficult to see from the life example of baking a pie mentioned above that, compared to model training, the pie crust is actually the training data, the microwave oven is the model algorithm, the temperature and duration are the model parameters, the prediction error is the difference between the actual taste and the expected taste, and the number of baking attempts is the number of iterations. I have summarized the comparison between baking a pie and model training in the diagram below.

After familiarizing yourself with the basic concepts related to model training, let’s review the linear regression training code we just mentioned. In addition to the three setXXX functions mentioned in the table, for more options related to model definition, you can refer to the official website’s development API for complete content. Once the model is defined, we can complete the training process by calling the fit function.

import org.apache.spark.ml.regression.LinearRegression

// Build a linear regression model, specify feature vector, target variable, and number of iterations
val lr = new LinearRegression()
.setLabelCol("SalePriceInt")
.setFeaturesCol("features")
.setMaxIter(10)

// Train the linear regression model using the training set trainSet
val lrModel = lr.fit(trainSet)

Model Evaluation #

After training the model, we need to validate and evaluate its performance to determine whether the model is “good” or “bad”. It is just like when the pie is baked, we need to taste it ourselves to see if its taste matches our expected flavor.

First, let’s take a look at how the model performs on the training dataset. In the evaluation of linear regression models, we have many indicators to quantify the prediction error of the model. Among them, the most representative one is RMSE (Root Mean Squared Error), which calculates the square root of the average of the squared errors. We can obtain the evaluation indicators of the model on the training dataset by calling the summary function on the model, as shown below.

val trainingSummary = lrModel.summary
println(s"RMSE: ${trainingSummary.rootMeanSquaredError}")

/** Result
RMSE: 45798.86
*/

In the data distribution of the training dataset, the price of houses ranges from 34,900 to 755,000. Therefore, a prediction error of 45,798.86 is relatively large. This indicates that the obtained model does not even fit the training data well. In other words, the trained model is in an “underfitting” state.

This is actually easy to understand. On the one hand, our model is too simple, and the fitting capacity of linear regression itself is limited.

Furthermore, in terms of data, we currently only use four fields (LotAreaInt, GrLivAreaInt, TotalBsmtSFInt, GarageAreaInt). There are many factors that affect house prices, and it is difficult to accurately predict house prices using only four property attributes. Therefore, in the upcoming lectures, we will continue to study the influence of feature engineering and model selection on the fitting capacity of the model.

Faced with this underfitting situation, we naturally need to further debug and optimize this model. In the subsequent lectures, we will improve our “house price prediction” model step by step from the perspectives of feature engineering and model tuning. Let’s wait and see!

Key Review #

Today’s content is quite extensive, so let’s summarize together. In today’s lecture, we mainly focused on the “house price prediction” mini-project, introducing the basic concepts of machine learning and how to use the Spark MLlib framework to complete machine learning development.

Firstly, you need to understand the computational process of machine learning. Machine learning refers to such a computational process. Given training data (training samples), a priori data distribution model (models) is selected, and then the model parameters (model weights/parameters) are automatically adjusted continuously using optimization algorithms (learning algorithms), so that the model can continuously approach the original distribution of the training data.

Next, under the Spark MLlib sub-framework, you need to grasp the basic process and key steps of machine learning development. I have organized these steps in the following table for you to review at any time.

In today’s lecture, we used a combination of “fundamental knowledge of machine learning” and “Spark MLlib development process” to explain both machine learning itself and the Spark MLlib sub-framework. For students with a weaker background in machine learning, learning today’s content might be a bit challenging.

However, there is no need to worry. In the subsequent lectures, we will gradually fill in the gaps left in this lecture, striving to enable you to systematically master the development methods and standard practices of machine learning.

Daily Practice #

Please organize all the code from loading data to model training and evaluation in order as discussed in this lecture. Then, download the training data from the “House Prices - Advanced Regression Techniques” competition project on Kaggle (a data science competition platform), and complete the entire process from data loading to model training.

Feel free to interact with me in the comments section, and I also recommend you to share this lecture with more colleagues and friends. Let’s all try our hands at the entire process from data loading to model training!