27 Model Training Ii Detailed Explanation of Regression, Classification, and Clustering Algorithms

27 Model Training II - Detailed Explanation of Regression, Classification, and Clustering Algorithms #

Hello, I’m Wu Lei.

In the previous lecture, we learned about the decision tree algorithms, including decision trees, GBDT, and random forests. In today’s lecture, let’s see how to apply these algorithms to practical scenarios using the Spark MLlib framework.

Do you still remember the Spark MLlib model algorithm “panorama map” we provided? We will often review this “panorama map”. On the one hand, it provides us with a “global perspective”, and on the other hand, with it, we can easily match the content we have learned and have a clear understanding of our learning progress.

In today’s lecture, we will combine the house prediction scenario to learn the specific usage of typical algorithms in regression, classification, and clustering in the Spark MLlib framework. Once you master these usages, you can flexibly and efficiently choose algorithms from the algorithm collection for the same type of machine learning problem (regression, classification, or clustering).

Housing Prediction Scenarios #

In this scenario, we have three instances: house price prediction, housing classification, and housing clustering. House price prediction is something we are all familiar with - in our previous studies, we have been trying to make house price predictions more accurate.

Housing classification refers to classifying all houses into corresponding label values based on given discrete labels (such as “OverallQual” which represents the quality of the houses) and housing attribute features. The labels could be categorized as “good, medium, poor” based on house quality.

Housing clustering refers to grouping similar houses together based on the house feature vectors and the concept of “birds of a feather flock together,” without the use of any labels.

House Price Prediction #

In the two lectures on feature engineering, we have been trying to fit house prices using linear models, but their fitting abilities are quite limited. The decision tree series models belong to the nonlinear models and are better in terms of fitting abilities. After the previous explanations, you should already be familiar with the “routine” of model training under the Spark MLlib framework, which can basically be divided into three stages:

Preparing training samples
Defining the model and fitting the training data
Validating the model’s effectiveness

Except for model definition, the first and third stages are actually quite universal. Regardless of which model we use, the training samples are more or less the same, and the measurement indicators (whether it is RMSE used for regression or AUC used for classification) are also model-independent. Therefore, in today’s lecture, we will focus on the second stage. As for the code implementation, we will only paste the code for this stage in this documentation, while you can refer to the content of the two lectures on feature engineering for the code of the other stages.

In the previous lecture, we learned about the decision tree series models and their derivative algorithms, namely random forest and GBDT. Both of these algorithms can be used to solve classification problems as well as regression problems. Since GBDT is good at fitting residuals, let’s use it to solve the house price prediction (regression) problem, and leave the random forest algorithm for the housing classification later on.

To use GBDT to fit house prices, we first need to prepare the training samples.

// numericFields represent numeric fields, indexFields represent non-numeric fields processed by StringIndexer
val assembler = new VectorAssembler()
.setInputCols(numericFields ++ indexFields)
.setOutputCol("features")

// create the feature vector "features"
engineeringDF = assembler.transform(engineeringDF)

import org.apache.spark.ml.feature.VectorIndexer

// differentiate between discrete features and continuous features
val vectorIndexer = new VectorIndexer()
.setInputCol("features")
.setOutputCol("indexedFeatures")
// set the differentiation threshold
.setMaxCategories(30)

// complete the data transformation
engineeringDF = vectorIndexer.fit(engineeringDF).transform(engineeringDF)

We have already learned about the usage of VectorAssembler before, which is used to combine multiple fields into a feature vector. You may have noticed that after VectorAssembler, we further transformed engineeringDF using a new feature processing function called VectorIndexer. What is this function used for?

Simply put, it is used to help decision tree series algorithms (such as GBDT and random forest) differentiate between discrete features and continuous features. Continuous features refer to numeric features where the numbers themselves have a natural order. On the other hand, after being transformed into numbers through StringIndexer, discrete features (such as street types) introduce a size relationship that originally did not exist among numbers (you can refer back to Lecture 25 for more details).

How do we solve this problem? First of all, for discrete features that have been processed by StringIndexer, VectorIndexer further encodes them to eliminate the comparison relationship between numbers, thereby clearly informing GBDT and other algorithms that these features are discrete, and that the numbers are independent from one another without any relationship.

The setMaxCategories method of the VectorIndexer object is used to set a threshold to differentiate between discrete features and continuous features. In this case, we set the threshold to 30. What is this threshold used for? Features with a cardinality greater than 30 will be considered continuous features by the subsequent GBDT model, while features with less than 30 cardinality will be considered discrete features.

At this point, you may ask, “Is it so important to differentiate whether a feature is continuous or discrete? Is it necessary to go through all this trouble?” Still remember the concept of “purity” of features in the basic principles of decision trees? For the same data sample and the same feature, the “purity” of continuous values and discrete values can be vastly different. Restoring the original “purity” ability of features will lay a good foundation for the rational construction of decision trees.

Okay, after preparing the samples, next, we will define and fit the GBT model.

import org.apache.spark.ml.regression.GBTRegressor

// Define the GBT model
val gbt = new GBTRegressor()
.setLabelCol("SalePriceInt")
.setFeaturesCol("indexedFeatures")
// Set the maximum depth of each tree
.setMaxDepth(5)
// Set the maximum number of trees in the GBT model
.setMaxIter(30)

// Split the data into training set and test set
val Array(trainingData, testData) = engineeringDF.randomSplit(Array(0.7, 0.3))

// Fit the training data
val gbtModel = gbt.fit(trainingData)

As you can see, we use GBTRegressor to define the GBT model. The methods setLabelCol and setFeaturesCol are commonly used and need not be repeated. It’s worth noting that setMaxDepth and setMaxIter are used to avoid overfitting in the GBT model. The former sets the maximum depth of each tree, and the latter limits the total number of decision trees in the GBT model. The training process calls the fit method of the model as usual.

So far, we have introduced how to fit house prices using the defined GBT model. For the evaluation of the model, I encourage you to refer to the model validation section in Lesson 23 and try it yourself. Keep up the good work!

House Classification #

Next, let’s talk about house classification. As we know, in the “House Prices - Advanced Regression Techniques” competition, there are a total of 79 fields in the dataset. Previously, we have always treated the sale price (SalePrice) as the target label and used other fields to construct the feature vector.

Now, let’s switch perspectives and take the OverallQual field as the label, while using the sale price (SalePrice) as an ordinary field to participate in constructing the feature vector. In the house price prediction dataset, the house quality (OverallQual) is a discrete feature with 10 possible values, as shown in the figure below.

In this way, we have transformed the previous regression problem into a classification problem. However, regardless of the machine learning problem, the model training cannot be separated from the following three steps:

Preparing training samples
Defining the model and fitting the training data
Validating the model’s performance

In terms of preparing training samples, in addition to replacing the target label with OverallQual, we can completely reuse the code we just used to predict house prices with GBT.

// Label field: "OverallQual"
val labelField: String = "OverallQual"

import org.apache.spark.sql.types.IntegerType
engineeringDF = engineeringDF
.withColumn("indexedOverallQual", col(labelField).cast(IntegerType))
.drop(labelField)

Next, we can define the random forest model and fit the training data. In fact, except for the class name, the usage of RandomForestClassifier is almost the same as GBTRegressor in GBDT, as shown in the code snippet below.

import org.apache.spark.ml.regression.RandomForestClassifier

// Define the random forest model
val rf = new RandomForestClassifier()
// The label is no longer the house price, but the housing quality
.setLabelCol("indexedOverallQual")
.setFeaturesCol("indexedFeatures")
// Limit the maximum depth of each tree
.setMaxDepth(5)
// Limit the maximum number of trees in the forest
.setMaxIter(30)

// Differentiate training set and test set
val Array(trainingData, testData) = engineeringDF.randomSplit(Array(0.7, 0.3))

// Fit the training data
val rfModel = rf.fit(trainingData)

After training the model, in the third step, we will perform preliminary validation of the model.

It should be noted that when measuring the model performance, regression and classification problems have different sets of measurement indicators. After all, regression problems predict continuous values, and we often use different forms of errors (such as RMSE, MAE, MAPE, etc.) to evaluate the performance of regression models. For classification problems, which predict discrete values, we usually use indicators that can evaluate the “purity” of classification, such as accuracy, precision, recall, etc.

Here, taking accuracy as an example, we evaluate the fitting effect of the random forest model. The code is shown below.

import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator

// Make inferences on the training set
val trainPredictions = rfModel.transform(trainingData)

// Define the evaluation object for classification problems
val evaluator = new MulticlassClassificationEvaluator()
.setLabelCol("indexedOverallQual")
.setPredictionCol("prediction")
.setMetricName("accuracy")

// Calculate the accuracy metric on the training set inference results
val accuracy = evaluator.evaluate(trainPredictions)

Okay, so far, we have used house price prediction and housing classification as examples to introduce how to deal with regression problems and classification problems in the Spark MLlib framework. Classification and regression are the two most typical types of model algorithms in supervised learning, and we must be familiar with and master them. Next, let’s take housing clustering as an example to talk about unsupervised learning.

Housing Clustering #

In contrast to supervised learning, unsupervised learning refers to machine learning problems in which there is no label in the data samples.

Taking house data as an example, the entire dataset contains 79 fields. If we remove the “SalePrice” and “OverallQual” fields, then the original dataset becomes data samples without labels. You may wonder, “What can we do with these unlabeled samples?”

In fact, there are many things we can do. Based on the house data, we can use the idea of “birds of a feather flock together” to use the K-means algorithm to classify them. Moreover, in the next lecture on movie recommendation, we can also use frequent itemset mining algorithms to dig out the frequency of co-occurrence and association rules between different movies, thereby achieving recommendation. Today, let’s first talk about K-means. By combining the feature vectors of data samples and based on the relative distances between vectors, the K-means algorithm can divide all samples into K categories. This is also the origin of the “K” in the algorithm’s name. For example, each point in the graph represents a vector, and the results of K-means clustering will vary with different values of K.

In the Spark MLlib development framework, it is easy to perform clustering on any vectors.

First, in the first stage of model training, we need to prepare the training samples. Note that this time we remove the “SalePrice” and “OverallQual” fields.

import org.apache.spark.ml.feature.VectorAssembler

val assembler = new VectorAssembler()
// numericFields contain continuous features, oheFields are one-hot encoded discrete features
.setInputCols(numericFields ++ oheFields)
.setOutputCol("features")

Next, in the second stage, we define the K-means model and train the model using the prepared samples. As you can see, the model definition is very simple, just instantiate a KMeans object and specify the value of K using setK.

import org.apache.spark.ml.clustering.KMeans

val kmeans = new KMeans().setK(20)

val Array(trainingSet, testSet) = engineeringDF
.select("features")
.randomSplit(Array(0.7, 0.3))

val model = kmeans.fit(trainingSet)

Here, we are dividing different houses into 20 different categories. After training, we also need to evaluate the model’s performance. Since the data samples do not have labels, the evaluation metrics used for regression and classification are not suitable for unsupervised learning algorithms like K-means.

The design philosophy of K-means is “like attracts like”, so vectors in the same category should be close enough, while the distance between vectors in different categories should be as far as possible. Therefore, we can use distance-based metrics (such as Euclidean distance) to quantify the model’s performance of K-means.

import org.apache.spark.ml.evaluation.ClusteringEvaluator

val predictions = model.transform(trainingSet)

// Define a clustering evaluator
val evaluator = new ClusteringEvaluator()

// Calculate the Euclidean distance from all vectors to the cluster centroids
val euclidean = evaluator.evaluate(predictions)

Well, up to this point, we have used the unsupervised learning algorithm K-means to divide houses into different categories based on their feature vectors. However, you should note that the types generated using this method do not have real meaning. For example, they cannot represent house quality or house rating. Since that’s the case, why did we bother with K-means?

Although the results of K-means do not have real meaning, they quantitatively depict the similarities and differences between houses. You can understand it like this: we have generated new features for houses using K-means, and compared to the existing house attributes, these generated features often have a stronger correlation with the prediction targets (such as house price and house type). Therefore, by involving these new features (Generated Features) in the training of supervised learning, we can hope to optimize/improve the model effectiveness of supervised learning.

Alright, up to this point, combining the examples of house price prediction, house classification, and house clustering, we have successfully covered the three types of model algorithms: regression, classification, and clustering. Congratulations! We are just one step away from completing the Spark MLlib model algorithm journey. In the next lesson, we will continue learning two interesting model algorithms: collaborative filtering and frequent itemsets, using the scenario of movie recommendations.

Key Review #

In today’s lecture, you need to first understand the basic principles of the K-means algorithm. The design concept of clustering is “birds of a feather flock together”, and given any set of vectors, K-means can divide it into K subsets to complete the clustering.

The computation of K-means mainly relies on the relative distance between vectors. On one hand, its computation result can be directly used to classify “crowds” or “populations”. On the other hand, it can be used as generated features to participate in the training of supervised learning.

In addition, you need to understand the general usage of GBTRegressor and RandomForestClassifier. Among them, setLabelCol and setFeaturesCol are used to specify the target variable and feature vector of the model, respectively. while setMaxDepth and setMaxIter are used to set the hyperparameters of the model, namely the maximum tree depth and maximum number of iterations (the number of decision trees), in order to avoid overfitting.

Practice After Each Lesson #

For the two scenarios of house price prediction and house classification, do you think there is a necessity and possibility of reusing code (especially the code in the feature engineering part) between them?

Feel free to interact with me in the comments section, and I also recommend that you share the content of this lesson with more colleagues and friends.