24 Feature Engineering I What Common Feature Processing Functions Are There

24 Feature Engineering I - What Common Feature Processing Functions Are There #

Hello, I’m Wu Lei.

In the previous lecture, we built a simple linear regression model to predict house prices in the state of Iowa, USA. From the performance of the model, we can see that the predictive ability is very poor. However, there are reasons for this. On the one hand, the fitting ability of linear regression is limited, and on the other hand, the features we used are very few.

To improve the model performance, specifically in our “house price prediction” case, we need to make more accurate predictions. We need to optimize the model step by step from both the features and the model itself.

In the field of machine learning, there is a well-known “rule”: Garbage in, garbage out. It means that when we feed the model with “garbage” data, the predicted results of the model will also be “garbage”. Garbage is a joke, in fact, it refers to imperfect feature engineering.

There are many reasons for imperfect feature engineering, such as uneven data quality, low feature distinctiveness, inappropriate feature selection, and so on.

As beginners, we must remember one thing: Feature engineering constrains the model’s performance, and it determines the upper limit of the model’s performance, which is the “ceiling”. Model tuning is only to continuously approach this “ceiling”. Therefore, the first step to improve model performance is to do good feature engineering.

To lighten your learning burden, I have divided feature engineering into two parts. I will use the content of two lectures to introduce the complete feature engineering methods under the development framework of Spark MLlib. Generally speaking, we need to learn six categories of feature processing methods. In today’s lecture, we will first learn the first three categories, and then learn the other three categories in the next lecture.

Course Schedule #

Open the Spark MLlib feature engineering page, and you will find an endless list of feature processing functions, which can be overwhelming for beginners. Seeing such a long list, you might feel lost and don’t know where to start.

However, don’t worry. For the functions listed, based on past experience, I will classify them from the perspective of feature engineering to make it easier to understand.

As shown in the figure, we have a long way to go from the raw data to generating training samples that can be used for model training (this process is also called “feature engineering”). Generally speaking, we can divide the fields in the raw data into numeric and categorical types because they require different processing methods.

In the figure, from left to right, the Spark MLlib feature processing functions can be classified into the following categories:

Preprocessing
Feature Selection
Normalization
Discretization
Embedding
Vector Operations

In addition, Spark MLlib also provides some basic functions for natural language processing (NLP), as shown in the dashed box in the upper left corner of the figure. As an introductory course, this part is not our focus today. If you are interested in NLP, you can learn more details on the official website.

I will explain in detail one representative function from each category (the functions in bold in the figure), combined with the “House Price Prediction” project. As for other processing functions, they are similar to the ones we will discuss in the same category. So, as long as you patiently follow along and learn this part, when you explore other processing functions on the official website, you will also make progress more efficiently.

Feature Engineering #

Next, let’s combine the “House Price Prediction” project from the previous lesson to explore the rich and powerful feature processing functions of Spark MLlib.

In the previous lesson, our model only used 4 features, namely “LotArea”, “GrLivArea”, “TotalBsmtSF”, and “GarageArea”. Selecting these 4 features to model means that we made a strong prior assumption: the house price is only related to these 4 property attributes. Obviously, this assumption is not reasonable. As a consumer, when deciding whether to buy a house, we will never rely solely on these 4 property attributes.

The Iowa House Price data provides as many as 79 property attributes, some of which are numerical fields, such as various sizes, areas, quantities, etc., and some are non-numerical fields, such as house type, street type, construction date, foundation type, etc.

Obviously, the house price is determined by multiple attributes among these 79 attributes. The task of machine learning is to first identify these “determining” factors (property attributes), and then quantify the impact of different factors on the house price using a weight vector (model parameters).

Preprocessing: StringIndexer #

Since most models (including linear regression models) cannot directly “consume” non-numerical data, our first step is to convert non-numerical fields in the property attributes into numerical fields. In feature engineering, for these basic data transformation operations, we uniformly call it preprocessing.

We can use the StringIndexer provided by Spark MLlib to perform preprocessing. As the name suggests, the role of StringIndexer is to convert strings in a data column into numerical indexes on a per-column basis. For example, using StringIndexer, we can convert the strings in the “GarageType” attribute of the “GarageType” column into numbers, as shown in the following figure.

The usage of StringIndexer is relatively simple and can be divided into three steps:

The first step is to instantiate the StringIndexer object.
The second step is to specify the input column and output column through setInputCol and setOutputCol.
The third step is to call the fit and transform functions to complete the data conversion.

Next, we will combine the “House Price Prediction” project from the previous lesson to use StringIndexer to transform all non-numerical fields, in order to demonstrate and learn its usage.

First, we read the house source data and create a DataFrame.

import org.apache.spark.sql.DataFrame

// Here, the underscore "_" is a placeholder, representing the root directory of the data file
val rootPath: String = _
val filePath: String = s"${rootPath}/train.csv"

val sourceDataDF: DataFrame = spark.read.format("csv").option("header", true).load(filePath)

Then, we select all non-numerical fields and use StringIndexer to transform them.

// Import StringIndexer
import org.apache.spark.ml.feature.StringIndexer

// All non-numerical fields, that is, the "input columns" required by StringIndexer
val categoricalFields: Array[String] = Array("MSSubClass", "MSZoning", "Street", "Alley", "LotShape", "LandContour", "Utilities", "LotConfig", "LandSlope", "Neighborhood", "Condition1", "Condition2", "BldgType", "HouseStyle", "OverallQual", "OverallCond", "YearBuilt", "YearRemodAdd", "RoofStyle", "RoofMatl", "Exterior1st", "Exterior2nd", "MasVnrType", "ExterQual", "ExterCond", "Foundation", "BsmtQual", "BsmtCond", "BsmtExposure", "BsmtFinType1", "BsmtFinType2", "Heating", "HeatingQC", "CentralAir", "Electrical", "KitchenQual", "Functional", "FireplaceQu", "GarageType", "GarageYrBlt", "GarageFinish", "GarageQual", "GarageCond", "PavedDrive", "PoolQC", "Fence", "MiscFeature", "MiscVal", "MoSold", "YrSold", "SaleType", "SaleCondition")

// The corresponding target index fields for non-numerical fields, that is, the "output columns" required by StringIndexer
val indexFields: Array[String] = categoricalFields.map(_ + "Index").toArray

// Define engineeringDF as a var variable, and all subsequent feature engineering will be performed on this DataFrame
var engineeringDF: DataFrame = sourceDataDF

// Core code: Iterate over all non-numerical fields, define StringIndexer one by one, and complete the conversion from string to numeric index
for ((field, indexField) <- categoricalFields.zip(indexFields)) {

// Define StringIndexer and specify the input column name and output column name
val indexer = new StringIndexer()
.setInputCol(field)
.setOutputCol(indexField)

// Use StringIndexer to transform the original data
engineeringDF = indexer.fit(engineeringDF).transform(engineeringDF)

// Drop the original non-numerical field column
engineeringDF = engineeringDF.drop(field)
}

Although the code looks long, we only need to focus on the part related to StringIndexer. We have just introduced the three steps of StringIndexer usage. Let’s correspond these steps to the code above, so that we can have a more intuitive understanding of the specific usage of StringIndexer.

We’ll use the “GarageType” field as an example for the GarageType feature column. First, we initialize a StringIndexer instance. Then, we pass GarageType to its setInputCol function. Next, we pass GarageTypeIndex to its setOutputCol function.

Note that GarageType is the original field, which is a data column already included in the engineeringDF DataFrame. GarageTypeIndex is the new column that will be generated by the StringIndexer, and it is not currently included in the engineeringDF.

Finally, on top of the StringIndexer, we call the fit and transform functions in sequence to generate the output column. Both functions take the DataFrame to be transformed as parameters. In our example, this DataFrame is engineeringDF.

After the transformation, you will find that engineeringDF has a new data column called GarageTypeIndex. This column contains the numerical index values corresponding to the values in the GarageType column, as shown below.

engineeringDF.select("GarageType", "GarageTypeIndex").show(5)

/** Result printed
+----------+---------------+
|GarageType|GarageTypeIndex|
+----------+---------------+
| Attchd|     0.0|
| Attchd|     0.0|
| Attchd|     0.0|
| Detchd|     1.0|
| Attchd|     0.0|
+----------+---------------+
only showing top 5 rows
*/

As you can see, after the transformation, all occurrences of “Attchd” in the GarageType column are mapped to 0, and all “Detchd” are mapped to 1. In fact, the remaining strings such as “CarPort” and “BuiltIn” are also converted to their corresponding index values.

To apply similar transformations to all non-numerical fields in the DataFrame, we use a for loop for iteration. You can try running the complete code above yourself and further verify that the transformations of other fields (except GarageType) are also as expected.

Congratulations! With the example of StringIndexer, we have successfully completed the preprocessing step in Spark MLlib and cleared the first hurdle of feature engineering. Now, let’s continue our efforts and tackle the second challenge: feature selection.

Feature Selection: ChiSqSelector #

Feature selection, as the name suggests, is the selection of feature fields based on certain criteria.

Taking housing data as an example, it contains 79 attribute fields. Among these 79 attributes, different attributes have different levels of impact on house prices. Obviously, features like house age and the number of bedrooms are much more important than heating methods. Feature selection is the process of selecting key features like house age and the number of bedrooms, and then building models while disregarding heating methods that are insignificant to the prediction target (house prices).

It is not difficult to see that in the previous example, we used everyday life experience as the criteria for selecting feature fields. In practice, business experience is often one of the important starting points for feature selection when faced with a large number of candidate features. In internet business scenarios such as search, recommendation, and advertising, we respect the experience of product managers and business experts and combine their feedback to initially filter out candidate feature sets.

At the same time, we also use some statistical methods to calculate the correlation between candidate features and the prediction target, thus quantifying the importance of different features for the prediction target.

Statistical methods not only validate the effectiveness of expert experience but also complement it. Therefore, when doing feature engineering in daily work, we often combine the two to perform feature selection.

Business experience varies depending on the scenario and cannot be generalized, so let’s focus on quantifiable statistical methods. The principles of statistical methods are not complicated. Essentially, they are based on different algorithms (such as Pearson correlation coefficient, chi-square distribution) to calculate the correlation between candidate features and the prediction target. However, you may ask, “I am not a statistics major, do I need to learn these statistical methods first to do feature selection?”

Don’t worry, it is not necessary. Spark MLlib framework provides us with various feature selectors, which encapsulate different statistical methods. To do feature selection well, we only need to understand how to use these selectors, without worrying about the specific statistical methods behind them.

Taking ChiSqSelector as an example, it encapsulates the statistical methods chi-square test and chi-square distribution. Even if you are not familiar with the working principles of the chi-square test, it does not affect our use of ChiSqSelector to easily perform feature selection.

Next, let’s continue with the example of the “house price prediction” project and discuss the usage and considerations of ChiSqSelector. Since it is a quantitative method, it means that Spark MLlib’s selectors can only be used for numerical fields. To use ChiSqSelector to select numerical fields, we need to complete two steps:

The first step is to create a feature vector using VectorAssembler.
The second step is to perform feature selection based on the feature vector using ChiSqSelector.

VectorAssembler originally belongs to the category of vector computation in feature engineering. However, in the Spark MLlib framework, the input parameters of many feature processing functions are feature vectors, such as the ChiSqSelector we are going to talk about now. Therefore, let’s start with a brief introduction to VectorAssembler.

The purpose of VectorAssembler is to combine multiple numerical columns into a single feature vector. Taking three numerical columns of housing data, “LotFrontage”, “BedroomAbvGr”, and “KitchenAbvGr”, as an example, VectorAssembler can combine them into a new vector field, as shown in the following image.

The usage of VectorAssembler is simple. After initializing a VectorAssembler instance, call setInputCols to pass in the list of numerical fields to be transformed (e.g., the 3 fields in the above image), and use setOutputCol to specify the feature vector field to be generated (e.g., the “features” field in the above image). Next, let’s demonstrate the specific usage of VectorAssembler with code.

// All numerical fields, a total of 27 fields
val numericFields: Array[String] = Array("LotFrontage", "LotArea", "MasVnrArea", "BsmtFinSF1", "BsmtFinSF2", "BsmtUnfSF", "TotalBsmtSF", "1stFlrSF", "2ndFlrSF", "LowQualFinSF", "GrLivArea", "BsmtFullBath", "BsmtHalfBath", "FullBath", "HalfBath", "BedroomAbvGr", "KitchenAbvGr", "TotRmsAbvGrd", "Fireplaces", "GarageCars", "GarageArea", "WoodDeckSF", "OpenPorchSF", "EnclosedPorch", "3SsnPorch", "ScreenPorch", "PoolArea")

// Predicted label field
val labelFields: Array[String] = Array("SalePrice")

import org.apache.spark.sql.types.IntegerType

// Convert all numeric fields to integers
for (field <- (numericFields ++ labelFields)) {
    engineeringDF = engineeringDF.withColumn(s"${field}Int",col(field).cast(IntegerType)).drop(field)
}

import org.apache.spark.ml.feature.VectorAssembler

// Get all integer numeric fields
val numericFeatures: Array[String] = numericFields.map(_ + "Int").toArray

// Define and initialize VectorAssembler
val assembler = new VectorAssembler()
.setInputCols(numericFeatures)
.setOutputCol("features")

// Apply VectorAssembler to DataFrame and generate the "features" vector column
engineeringDF = assembler.transform(engineeringDF)

The code provided mostly focuses on the last two lines. First, we define and initialize a VectorAssembler instance. We pass an array, numericFeatures, that contains all the names of integer numeric fields as the input to the setInputCols function. We use the setOutputCol function to specify the output column name as “features”. Then, we use the transform function of VectorAssembler to perform the transformation on engineeringDF.

After the transformation, engineeringDF will contain a column named “features” that contains the concatenated feature vector of all the numeric features.

Now that the feature vector is prepared, we can proceed with feature selection based on it. Let’s see the code.

import org.apache.spark.ml.feature.ChiSqSelector
import org.apache.spark.ml.feature.ChiSqSelectorModel

// Define and initialize ChiSqSelector
val selector = new ChiSqSelector()
.setFeaturesCol("features")
.setLabelCol("SalePriceInt")
.setNumTopFeatures(20)

// Use the fit function to perform chi-square test on the DataFrame
val chiSquareModel = selector.fit(engineeringDF)

// Get the selected feature indices from the ChiSqSelector
val indexs: Array[Int] = chiSquareModel.selectedFeatures

import scala.collection.mutable.ArrayBuffer

val selectedFeatures: ArrayBuffer[String] = ArrayBuffer[String]()

// Find the original field names of the data columns based on the feature indices
for (index <- indexs) {
    selectedFeatures += numericFields(index)
}

We start by defining and initializing a ChiSqSelector instance. We use setFeaturesCol and setLabelCol to specify the feature vector and the predicted label. After all, the ChiSqSelector encapsulates the chi-square test, which requires correlating the features with the predicted label to quantify the importance of each feature.

Next, we need to specify how many features we want to select from the 27 numeric features. We pass the parameter 20 to setNumTopFeatures, which means that ChiSqSelector will select the top 20 features that have the most significant impact on the housing price from the 27 features.

Once the ChiSqSelector instance is created, we call the fit function to perform the chi-square test on engineeringDF and obtain the chi-square model chiSquareModel. By accessing the selectedFeatures variable of chiSquareModel, we get the indices of the selected features. Combined with the original numeric fields array, we can obtain the names of the selected original data columns.

By now, you might be a bit confused. Don’t worry. With the help of the following diagram, you can better understand the workflow of ChiSqSelector. Let’s continue using the example of “LotFrontage,” “BedroomAbvGr,” and “KitchenAbvGr” fields.

As you can see, for housing price prediction, ChiSqSelector considers the first two fields more important than the number of kitchens. Therefore, in the selectedFeatures array, ChiSqSelector records the indices 0 and 1, which correspond to the original fields “LotFrontage” and “BedroomAbvGr.”

Got it? We have learned about the usage of feature selection in Spark MLlib, represented by ChiSqSelector, and completed the second part of feature engineering. Let’s keep going and challenge the next part: normalization.

Normalization: MinMaxScaler #

The purpose of normalization is to map a set of values to a unified range, usually [0, 1]. In other words, regardless of the magnitude of the original data sequence, whether it’s 10^5 or 10^-5, normalization will scale them to the range of [0, 1].

This may sound abstract, so let’s take the “LotArea” and “BedroomAbvGr” fields as examples. The “LotArea” represents the area of a house, measured in square feet with a magnitude of 10^5, while the “BedroomAbvGr” represents the number of bedrooms, with a magnitude of 10^1.

Assuming we use the MinMaxScaler provided by Spark MLlib for normalizing the housing data, both of these columns will be scaled to the range of [0, 1], eliminating the dimensional differences caused by different units.

You might ask, “Why do we need normalization? Doesn’t the original data work just fine?”

The original data is great, but the dimensional differences in the original data aren’t great. When there is a significant difference in dimensionality between the original data, the gradient descent in the model training process becomes unstable, leading to poor convergence and training efficiency. On the contrary, when all feature data is constrained to the same range, the training efficiency of the model improves significantly. We will delve further into model training and optimization in the next lesson, for now, understanding the necessity of normalization is sufficient.

Since normalization is so important, how can we implement it specifically? It’s actually quite simple, just one function can do it.

Spark MLlib supports various normalization functions, such as StandardScaler, MinMaxScaler, and more. Although these functions have different algorithms, the effect is the same.

Let’s take MinMaxScaler as an example. For any house area “ei”, MinMaxScaler uses the following formula to normalize the “LotArea” field:

Here, “max” and “min” are the upper and lower limits of the target range, which are typically set to 1 and 0, respectively. In other words, the target range is [0, 1]. “Emax” and “Emin” are the maximum and minimum values in the “LotArea” column. Using this formula, MinMaxScaler maps all values in “LotArea” to the range of [0, 1].

Next, let’s demonstrate the specific usage of MinMaxScaler with code examples.

Like many feature processing functions (such as the previously mentioned ChiSqSelector), the input parameter of MinMaxScaler is also a feature vector. Therefore, the usage of MinMaxScaler can be divided into two steps:

The first step is to use VectorAssembler to create a feature vector.
The second step is to use MinMaxScaler based on the feature vector to perform normalization.

// All numeric fields of type Int
val numericFeatures: Array[String] = numericFields.map(_ + "Int").toArray

// Loop through each numeric field
for (field <- numericFeatures) {

  // Define and initialize VectorAssembler
  val assembler = new VectorAssembler()
    .setInputCols(Array(field))
    .setOutputCol(s"${field}Vector")

  // Transform each field from Int to Vector type
  engineeringData = assembler.transform(engineeringData)
}

In the first step, we use a for loop to iterate through all numeric fields and initialize a VectorAssembler instance to convert the fields from Int type to Vector type. Next, in the second step, we can pass all the vectors to MinMaxScaler for normalization. As you can see, the usage of MinMaxScaler is similar to that of StringIndexer.

import org.apache.spark.ml.feature.MinMaxScaler

// Select all vector data columns
val vectorFields: Array[String] = numericFeatures.map(_ + "Vector").toArray

// Scaled data columns after normalization
val scaledFields: Array[String] = vectorFields.map(_ + "Scaled").toArray

// Loop through all vector data columns
for (vector <- vectorFields) {

  // Define and initialize MinMaxScaler
  val minMaxScaler = new MinMaxScaler()
    .setInputCol(vector)
    .setOutputCol(s"${vector}Scaled")
    
  // Use MinMaxScaler to perform normalization on vector data columns
  engineeringData = minMaxScaler.fit(engineeringData).transform(engineeringData)
}

First, we create an instance of MinMaxScaler, and then pass the original vector data column and the normalized data column to the setInputCol and setOutputCol functions, respectively. Next, we call the fit and transform functions in sequence to perform normalization on the targeted fields.

After executing this snippet of code, engineeringData (a DataFrame) will contain multiple columns with the suffix “Scaled”, which correspond to the normalized data of the original fields, as shown below:

Alright, that’s it! We have learned the usage of data normalization in the Spark MLlib framework, represented by MinMaxScaler, and overcome the third challenge in feature engineering.

Key Review #

Alright, we have finished today’s content, let’s summarize together. In today’s lecture, we mainly focused on feature engineering. You need to master the feature processing methods in different stages of feature engineering, especially those most representative feature processing functions.

From raw data to generating training samples, feature engineering can be divided into the following stages. Today, we specifically explained the first three stages, namely preprocessing, feature selection, and normalization.

For different stages, Spark MLlib framework provides a rich set of feature processing functions. As a representative of the preprocessing stage, StringIndexer is responsible for the preliminary processing of non-numeric features, converting strings that the model cannot directly consume into numerical values.

The motivation for feature selection is to extract features that are more closely related to the prediction target, thereby reducing the size of the model and improving the model’s generalization ability. Feature selection can be approached from two aspects: expert knowledge based on business considerations and data-based statistical analysis.

Based on different statistical methods, Spark MLlib provides various feature selectors. Among them, ChiSqSelector is based on the chi-square test and selects the top N features with the highest relevance.

The purpose of normalization is to remove the influence of different dimensions between features, avoiding problems such as gradient oscillation and low convergence efficiency due to inconsistent dimensions. The specific method of normalization is to scale different features to the same range. In this regard, Spark MLlib provides multiple normalization methods for developers to choose from.

In the next lecture, we will continue studying the three stages of discretization, embedding, and vector calculation. Finally, we will take an overall look at the model’s performance after optimization in each stage. Stay tuned.

Exercises for Each Lesson #

Can you explain the differences and similarities between the feature processing functions we discussed today, such as StringIndexer, ChiSqSelector, and MinMaxScaler?

Feel free to leave a message in the comments section to interact with me. I also recommend that you share today’s content with more colleagues and friends to discuss feature engineering-related topics together.