25 Feature Engineering Ii What Common Feature Processing Functions Are There (Continued)

25 Feature Engineering II - What Common Feature Processing Functions Are There (continued) #

Hello, I am Wu Lei.

In the previous lecture, we mentioned that typical feature engineering consists of several steps, including preprocessing, feature selection, normalization, discretization, embedding, and vector computation, as shown in the figure below.

In the previous lecture, we focused on the first three steps, namely preprocessing, feature selection, and normalization. According to the previous course arrangement, in today’s lecture, we will continue to discuss the remaining steps, which are discretization, embedding, and vector computation.

Feature engineering is of utmost importance in machine learning. As long as you continue to study patiently, it will definitely be worth it. At the end of this lecture, I will also compare the performance of models that applied six different feature engineering methods to help you better understand the role and effects of different steps in feature engineering.

Feature Engineering #

In the previous lecture, we completed “Level 3”: normalization. Therefore, next, let’s start with “Level 4”: discretization.

Discretization: Bucketizer #

Similar to normalization, discretization is also used to handle numerical fields. Discretization can break down the originally continuous values into discrete intervals, thereby reducing the diversity (cardinality) of the original data. For example, the field “BedroomAbvGr” represents the number of bedrooms. In the dataset train.csv, the “BedroomAbvGr” field contains continuous integers from 1 to 8.

Now, based on the number of bedrooms, we divide the houses into small, medium, and large sizes.

It is not difficult to see that after discretizing “BedroomAbvGr”, the diversity of the data is reduced from the original 8 to 3. Now, the question is, why do we need to discretize the original continuous data? The main motivation for discretization is to enhance the distinguishability and coherence of the feature data, thus producing a stronger association with the prediction target.

Taking “BedroomAbvGr” as an example, we believe that the impact of one-bedroom and two-bedroom on house prices is not significantly different. Similarly, the impact of three-bedroom and four-bedroom on house prices is also minimal.

However, there is often a leap in house prices between small and medium-sized houses, as well as between medium and large-sized houses. In other words, compared to the number of bedrooms, the difference in house types has a greater impact and higher distinguishability on house prices. Therefore, the purpose of discretizing “BedroomAbvGr” is to enhance its association with the prediction target.

Now, under the Spark MLlib framework, how should we perform discretization? Similar to other steps, Spark MLlib provides multiple discretization functions, such as Binarizer, Bucketizer, and QuantileDiscretizer. Let’s take Bucketizer as an example and demonstrate the specific usage of discretization with the field “BedroomAbvGr”. As usual, let’s start with the code.

// Original field
val fieldBedroom: String = "BedroomAbvGrInt"
// Target field containing discretized data
val fieldBedroomDiscrete: String = "BedroomDiscrete"
// Specify the discretization intervals, which are [negative infinity, 2], [3, 4], and [5, positive infinity]
val splits: Array[Double] = Array(Double.NegativeInfinity, 3, 5, Double.PositiveInfinity)

import org.apache.spark.ml.feature.Bucketizer

// Define and initialize the Bucketizer
val bucketizer = new Bucketizer()
// Specify the input column
.setInputCol(fieldBedroom)
// Specify the output column
.setOutputCol(fieldBedroomDiscrete)
// Specify the discretization intervals
.setSplits(splits)

// Call transform to complete the discretization transformation

engineeringData = bucketizer.transform(engineeringData)

It is not difficult to find that the feature processing functions provided by Spark MLlib are similar in usage. First, we create an instance of Bucketizer, and then we pass the numerical field BedroomAbvGrInt as a parameter to setInputCol, while using setOutputCol to specify the new field BedroomDiscrete used to save the discrete data.

The process of discretization is to break continuous values into discrete values, but the specific discretization intervals need to be specified by us in setSplits. The discrete intervals are provided by the floating-point array splits, which divide the intervals of [negative infinity, 2], [3, 4], and [5, positive infinity] from negative infinity to positive infinity. Finally, we call the transform function of Bucketizer to discretize engineeringData.

The comparison of data before and after discretization is shown in the following figure.

Well, that’s it. With Bucketizer as an example, we have learned the usage of data discretization in the Spark MLlib framework, and easily overcome the fourth challenge in feature engineering.

Embedding #

In fact, Embedding is a very large topic. With the development of machine learning and artificial intelligence, the methods of Embedding are constantly changing and evolving. From the basic one-hot encoding to PCA dimensionality reduction, from Word2Vec to Item2Vec, from matrix factorization to deep learning-based collaborative filtering, various methods and models have emerged. Some scholars even proposed, “Everything can be Embedding”. So, what exactly is Embedding?

Embedding is an English term. If we have to find a corresponding Chinese translation, I think “向量化” (Vectorize) is the most appropriate. The process of Embedding is to map the data set to a vector space, and then vectorize the data. This sentence may sound a bit esoteric, let me explain it in a better way. The goal of Embedding is to find a set of suitable vectors to describe the existing data set.

Taking the GarageType field as an example, it has 6 values, which means we have a total of 6 garage types. So, how can we represent these 6 strings in a numerical way? After all, models can only consume numbers, not directly strings.

One approach is to use the StringIndexer in the preprocessing stage to convert strings into consecutive integers, and then let the model consume these integers. In theory, there is no problem with doing this. But from the perspective of the model’s performance, the expression of integers is not reasonable. Why would I say that?

We know that there is a comparison relationship between consecutive integers, such as 1 < 3, 6 > 5, and so on. But there is no size relationship between the original strings, such as “Attchd” and “Detchd”. If we forcibly use 0 to represent “Attchd” and 1 to represent “Detchd”, there will be a logical contradiction of “Attchd” < “Detchd”.

Therefore, the StringIndexer in the preprocessing stage only converts strings into numbers, and the resulting numbers cannot be directly fed to the model for training. We need to further vectorize these numbers so that they can be consumed by the model. So, how should we vectorize the numbers output by the StringIndexer? This is where Embedding comes into play.

As an introductory lesson, let’s start with the simplest method of One Hot Encoding to learn about Embedding and master its basic usage. Let’s first talk about what One Hot Encoding is, instead of explaining the concept literally. Starting with the example of GarageType, it will be easier for you to understand.

First, through StringIndexer, we map the 6 values of GarageType to 6 numbers ranging from 0 to 5. Next, using One Hot Encoding, we convert each number into a vector.

The dimension of the vector is 6, which is consistent with the diversity (cardinality) of the original field (GarageType). In other words, the dimension of the one-hot encoded vector is equal to the number of values in the original field.

By carefully observing the six vectors in the above figure, we can see that only one dimension has a value of 1, and all other dimensions are 0. The dimension with a value of 1 is consistent with the index output by the StringIndexer. For example, the string “Attchd” is mapped to 0 by the StringIndexer, and the corresponding one-hot vector is [1, 0, 0, 0, 0, 0]. The dimension with index 0 in the vector has a value of 1, while all other dimensions have a value of 0.

It is not difficult to find that one-hot encoding is a simple and direct method of Embedding, and it can even be described as “simple and crude”. However, in everyday machine learning development, the “simple and crude” one-hot encoding is quite popular. Next, let’s talk about the specific usage of one-hot encoding, starting from the “house price prediction” project.

In the preprocessing phase, we have already used StringIndexer to convert non-numerical fields into index fields. Next, we will use OneHotEncoder to further convert the index fields into vector fields.

import org.apache.spark.ml.feature.OneHotEncoder

// The target index fields for non-numerical fields, which are the "output columns" required by StringIndexer
// val indexFields: Array[String] = categoricalFields.map(_ + "Index").toArray

// The target fields for one-hot encoding, which are the "output columns" required by OneHotEncoder
val oheFields: Array[String] = categoricalFields.map(_ + "OHE").toArray

// Loop through all the index fields and perform one-hot encoding on them
for ((indexField, oheField) <- indexFields.zip(oheFields)) {
  val oheEncoder = new OneHotEncoder()
    .setInputCol(indexField)
    .setOutputCol(oheField)
  engineeringData = oheEncoder.transform(engineeringData)
}

As you can see, we loop through all the non-numerical features and create instances of OneHotEncoder one by one. During the initialization process, we pass the index field to the setInputCol function and the target field for one-hot encoding to the setOutputCol function. Finally, by calling the transform function of OneHotEncoder, the transformation is completed on engineeringData.

So far, we have learned the usage of embedding in the Spark MLlib framework, represented by OneHotEncoder, and have successfully completed the fifth challenge of feature engineering.

Although there are still many other embedding methods that need to be explored further, from an introductory perspective, OneHotEncoder is sufficient for most machine learning applications.

Vector Computation #

After completing the fifth challenge, there is one last level in the “game” of feature engineering: vector computation.

Vector computation, as the final step of feature engineering, is mainly used to construct feature vectors in training samples. In the Spark MLlib framework, a training sample consists of two parts. The first part is the prediction target (Label), which in the “house price prediction” project is the house price.

The second part is the feature vector, which can be seen as an array of Double elements. Based on the feature engineering process diagram mentioned earlier, it is not difficult to find that feature vectors are composed of various sources, such as original numerical fields, normalized or discretized numerical fields, and vectorized feature fields, and so on.

Spark MLlib provides rich support for vector computation, such as VectorAssembler for integrating feature vectors, VectorSlicer for slicing vectors, and ElementwiseProduct for element-wise multiplication. By flexibly using these functions, we can freely assemble feature vectors and prepare training samples required by models.

In the previous steps (preprocessing, feature selection, normalization, discretization, embedding), we attempted various transformations on numerical and non-numerical features in order to explore potential factors that may have a greater impact on the prediction target.

Next, we use VectorAssembler to concatenate all these potential factors together and construct the feature vector, preparing training samples for subsequent model training.

import org.apache.spark.ml.feature.VectorAssembler

/**
Selected numerical features: selectedFeatures
Normalized numerical features: scaledFields
Discretized numerical features: fieldBedroomDiscrete
One-hot encoded non-numerical features: oheFields
*/

val assembler = new VectorAssembler()
  .setInputCols(selectedFeatures ++ scaledFields ++ fieldBedroomDiscrete ++ oheFields)
  .setOutputCol("features")

engineeringData = assembler.transform(engineeringData)

After the transformation, the engineeringData DataFrame now contains a new field named “features”, which represents the feature vector of each training sample. Next, we can specify the feature vector and the label using setFeaturesCol and setLabelCol, respectively, to define the linear regression model, as shown in the previous lesson.

// Define the linear regression model
val lr = new LinearRegression()
  .setFeaturesCol("features")
  .setLabelCol("SalePriceInt")
  .setMaxIter(100)

// Train the model
val lrModel = lr.fit(engineeringData)

// Get the training summary
val trainingSummary = lrModel.summary
// Get the root mean squared error on the training data
println(s"Root Mean Squared Error (RMSE) on train data: ${trainingSummary.rootMeanSquaredError}")

Congratulations! By completing all the levels of feature engineering, you have successfully gone through the whole process. Although there are still many functions and techniques that we haven’t covered, you can use your spare time to learn and master them by following the learning approach I provided in these two lessons. Keep up the good work!

Achievement Reward: Model Performance Comparison #

After learning about the usage of VectorAssembler, you will find that the output of any step in feature engineering can be used to construct feature vectors for model training. In the section introducing feature engineering, we have spent a lot of time explaining the roles and usage of different steps.

You may wonder, “Do these different steps in feature processing really help improve model performance? After all, after all the trouble, we still need to see the model’s effectiveness.”

That’s right, the ultimate goal of feature engineering is to optimize model performance. Next, by feeding the training samples outputted from different steps to the model, we will compare the model performance corresponding to different feature processing methods.

The code addresses corresponding to different steps are as follows:

As you can see, as feature engineering progresses, the prediction error of the model on the training set becomes smaller and smaller, indicating that the model’s fitting ability is becoming stronger. This also means that feature engineering does help improve model performance.

I have compiled the training code corresponding to different steps of feature engineering into the last column “Code Addresses”. I strongly recommend that you run these codes to compare the feature processing methods of different steps and their corresponding model performance.

Of course, when evaluating the model’s performance, we should not only focus on its fitting ability, but also pay attention to its generalization ability. Strong fitting ability can only indicate that the prediction error of the model on the training set is small enough, while generalization ability quantifies the prediction error of the model on the test set. In other words, the meaning of generalization ability is how the model performs on a dataset that it has never seen before.

In this lecture, our focus is on feature engineering, so we temporarily ignore the performance of the model on the test set. Starting from the next lecture on model training, we will pay attention to both aspects of the model’s ability: fitting and generalization.

Key Recap #

Alright, we have finished discussing today’s content. Let’s summarize it together. In today’s lecture, we mainly focused on the discretization, embedding, and vector calculation in feature engineering. You need to master the most representative feature processing functions in these areas.

So far, we have covered all six categories of feature processing functions involved in Spark MLlib feature engineering. In order to give you an overall understanding of them and allow you to review the functions and their effects at any time, I have organized the characteristics of each category and the processing functions we have discussed into the following table for your reference.

Today’s content is quite extensive and requires us to spend more time digesting it. According to the 2/8 theory, feature engineering often consumes 80% of our time and effort in machine learning practice. Since feature engineering sets the upper limit for the model’s performance, even though the steps and processes of feature engineering are numerous and cumbersome, we must not take shortcuts in this step and must take it seriously.

This is also why we divided the explanation of feature engineering into two parts: the overview and each step, from the function of each step to the specific methods it contains. Data quality sets the ceiling for model performance, and feature engineering is a long and arduous journey. However, if we persevere, success will come. Let’s cheer together!

Exercise for each lesson #

In conjunction with the previous lesson, can you talk about the differences and similarities among all the feature processing functions we have introduced, such as StringIndexer, ChiSqSelector, MinMaxScaler, Bucketizer, OneHotEncoder, and VectorAssembler?

Please feel free to record your gains and reflections in the comment section, and you are also welcome to share today’s content with more colleagues and friends. It may help them solve problems related to feature engineering.