26 Model Training I Detailed Explanation of Decision Tree Algorithms

26 Model Training I - Detailed Explanation of Decision Tree Algorithms #

Hello, I’m Wu Lei.

In the previous lesson, we focused on feature engineering in machine learning and the feature processing functions supported by the Spark MLlib framework. Based on the linear regression model, we compared the model performance under different feature processing methods. Generally speaking, linear models have limited model capacity and are only suitable for fitting linear relationships between feature vectors and prediction targets.

However, in practical applications, linear relationships are rare. For example, in the “house price prediction” project, the relationship between different house attributes and house prices is obviously not simply linear. This is also why the prediction error of the linear regression model has been consistently high in the task of house price prediction. Therefore, in order to improve the accuracy of house price prediction, it is necessary for us to consider adopting other types of model algorithms, especially non-linear models, from the perspective of model selection.

The Spark MLlib framework supports a variety of model algorithms. In order to cover the content as comprehensively as possible while reducing your learning burden, I have divided the model training into three lessons: Part 1, Part 2, and Part 3. In today’s lesson, we will focus on explaining the decision tree algorithms.

In the following two lessons, I will combine the house price prediction and movie recommendation scenarios to help you master the Spark MLlib model algorithms in practice, so that you can skillfully carry out model selection and model tuning in different scenarios.

Course Schedule #

Because the content of model training is quite extensive, in order to give you a clear learning plan, let’s first explain the course schedule. In the field of machine learning, if we classify machine learning problems based on whether there is a prediction target label, we can divide them into supervised learning and unsupervised learning. Spark MLlib supports both types of machine learning algorithms, as shown in the diagram below.

As you can see, under the development framework of Spark MLlib, supervised learning is further divided into regression, classification, and collaborative filtering according to different usage scenarios. Unsupervised learning is divided into clustering and frequent pattern mining.

Under different categories, Spark MLlib supports a variety of model algorithms. If we explain the principles and usage of each algorithm one by one, it would not only be boring, but also easy to forget. Therefore, for each category, I will select the most representative algorithm and explain it with examples. This way, you will have a deeper impression after completing the course.

Corresponding to the 5 sub-categories, there are also 5 examples in the model training course, which are house price prediction, house classification, movie recommendation 1, house clustering, and movie recommendation 2. Based on the different data sources, these 5 examples can be further categorized into two types, as shown in the diagram below.

In order to accommodate students with weaker foundations, we need to first understand the pre-knowledge such as decision trees, GBDT (Gradient Boosted Decision Trees), and RF (Random Forest). After studying this lesson, you will find an interesting phenomenon that the principles behind these knowledge points are surprisingly similar to human decision-making processes, but compared to human experience, machines can surpass it.

Okay, let’s officially start today’s lesson.

Decision Tree Algorithms #

“Double Eleven” is coming soon, and you may be eager to go on a shopping spree. But when you touch your wallet, reason takes over. Imagine, with a limited budget, how would you choose a mobile phone? We often consider a series of factors such as price, brand, and reviews to make our decision.

In machine learning, this process of constructing a decision path based on different decisive factors is called a decision tree. Next, let’s describe what a decision tree is in more precise terms.

A decision tree is a tree-like structure built based on the feature vectors of samples. A decision tree consists of nodes and directed edges, with nodes divided into two categories: internal nodes and leaf nodes. Internal nodes represent the sample features, while leaf nodes represent classification.

For example, suppose we want to divide houses into 5 categories based on the features of “number of bedrooms” and “house area”. We can construct a decision tree to achieve this, as shown in the figure below.

In the figure, the elliptical shape represents internal nodes, with each internal node containing a feature and two directed edges. Each directed edge represents a set of feature values. For example, the root node (the top internal node) in the decision tree contains the feature “number of bedrooms”. The left directed edge represents the data samples with a number of bedrooms less than 4, while the right directed edge represents the data samples with a number of bedrooms greater than or equal to 4.

In this way, the original house samples are divided into two parts based on the number of bedrooms. The samples “falling” to the left side are further divided based on whether the “house area” is less than 6, while the samples “falling” to the right side are further divided based on whether the “house area” is less than 10. In this way, the data samples are differentiated layer by layer based on different ranges of different features, until the circular nodes, or leaf nodes, are reached.

Leaf nodes represent the classification of data samples. The 5 circular nodes in the figure correspond to the 5 different values of the label. Therefore, samples falling on the blue circular node have a house quality category of “poor”, and samples falling on the yellow circular node have a house quality category of “excellent”. In this way, we have completed the process of classifying the original samples based on the “house quality”.

In practice, the regression process is similar. If the label in the data samples are not discrete “house quality” but continuous “house price”, we can use the decision tree to make regression predictions. Suppose we use 100 data samples to build the decision tree above, and assume that each leaf node contains 20 data samples.

When a new data sample needs to predict the house price, we only need to let it traverse the decision tree and see which leaf node it falls into. Suppose it falls into the Set3 node, then to predict the house price of this sample, we simply take the average of the house prices of those 20 samples in Set3.

Alright, so far, we have introduced what a decision tree is and how to use a decision tree to predict new data samples. It is not difficult to see that the inference process of a decision tree is very similar to the decision-making process of humans.

Humans also often compare three or more options, make decisions based on life experience, and consider key factors. Speaking of which, you may be curious, “When I make a decision, I often take into account life experience. What criteria does the model algorithm use to build the decision tree? How does it know which features are decisive and which ones are not very useful?”

In summary, the purity of data samples determines which features the model algorithm selects as internal nodes and also determines when the decision tree converges. The so-called sample purity simply refers to the diversity of labels (Cardinality). For a set of samples, if all the label values are the same, that is, the diversity of labels is 1, then we say that the sample purity of this set is high.

On the contrary, if the label values in this set of samples are very diverse and the diversity is high, then we say that the sample purity of this set is low. Mathematically, we can quantify the purity of samples (or label diversity) using information entropy, but as an introductory course, we don’t need to delve into it for now. Just understand the concept of sample purity. When building a decision tree, the model algorithm will traverse each feature and examine its “purity” capabilities. The so-called “purity” refers to the improvement of purity in two sample subsets after differentiating them based on the original sample and features. In other words, the higher the purity of the sample subset after candidate feature segmentation, the higher the “purity” capabilities of the candidate feature.

Based on this logic, the model algorithm selects the features with the highest, second highest, and third highest “purity” capabilities one by one to construct the decision tree step by step until it converges. As for the convergence condition, on the one hand, we can artificially set a threshold for purity, and on the other hand, we can also restrict it by setting the depth and levels of the tree.

Ideally, we hope that the purity of each leaf node in the decision tree is as close to 0 as possible (quantified by information entropy), which means that each node has the same label. However, in practical work, it is difficult for us to achieve this. Moreover, generally speaking, the fitting ability of a decision tree is quite limited, and it is difficult to improve the purity of the samples enough.

This is where GBDT (Gradient-boosted Decision Trees) and RF (Random Forest) come in. Although their design ideas are different, they are essentially aimed at further improving the purity of data samples.

Random Forest #

Random Forest, also known as “random forest”, is based on the idea of “three ignorant cobblers surpassing Zhuge Liang”. Since the fitting ability of a single tree is limited, we can use multiple trees to “make up for it”. After all, as the saying goes: two heads are better than one.

For example, if we want to classify the quality of houses by combining multiple features, the Random Forest algorithm will train multiple decision trees for a given dataset. The trees are independent of each other and have no dependencies. For each tree, the algorithm randomly selects a subset of the samples and features to construct the decision tree, which is why the term “random” is used.

Taking the above figure as an example, the Random Forest algorithm constructs three decision trees. The first tree uses the “number of bedrooms” and “house area” features, the second tree selects the “building age”, “decoration condition”, and “house type” features, and the last tree selects the “whether it has a swimming pool”, “house area”, “decoration condition”, and “number of kitchens” features.

Each tree divides the traversed samples into 5 categories, each containing a portion of the samples. When there is a new data sample that needs to predict the quality of the house, we feed the data sample to the three trees in the random forest, and the prediction result depends on the output results of the three trees.

Suppose a sample falls into Set3 after being determined by the first tree, falls into Set2 after the decision of the second tree, and is classified into Set3 after the judgment of the third tree. The final prediction result of the sample is Set3. In other words, according to the principle of “the majority rules”, the final prediction result of the random forest is determined by the majority of the decision tree results. For regression problems, the simplest method is to take the average of all decision tree judgment results.

GBDT #

Next, let’s talk about GBDT (Gradient-boosted Decision Trees). Similar to Random Forest, GBDT also uses multiple decision trees to fit the data samples, but there is a dependency between the trees. The construction of each tree is based on the training results of the previous tree. Therefore, unlike Random Forest, the design idea of GBDT is to “stand on the shoulders of giants”, as shown in the following figure.

Specifically, during the training process of GBDT, the construction of each tree is based on the “sample residuals” output by the previous tree. As shown in the following figure, the difference between the predicted value and the ground truth is the sample residual. The fitting target of the subsequent decision tree is no longer the original house price but this sample residual.

By extension, the subsequent decision trees also fit based on the residuals of the previous tree, gradually reducing the error between the predicted value and the ground truth, and ultimately approaching 0. It is not difficult to find that as long as GBDT trains enough decision trees, the prediction error can be small enough, so the fitting ability of GBDT is very strong.

However, at the same time, we need to be cautious about the overfitting problem of GBDT, which tends to result in unsatisfactory performance of the model on the test set when overfitting on the training set. The solution to overfitting is to make the model simpler instead of complex. To achieve this, we can limit the number and depth of decision trees to reduce the complexity of the GBDT model.

Alright, so far we have learned about decision trees, as well as the Random Forest and GBDT algorithms derived from decision trees. In the next lesson, we will use house price prediction and house classification as examples to understand how to apply these algorithms to solve real-world problems using the Spark MLlib framework.

Key Review #

Alright, that’s all for today’s content. Let’s summarize what we’ve learned.

First, you need to know which model algorithms are supported by the Spark MLlib development framework. I have organized these model algorithms and their categories in the mind map below for your reference.

You need to understand the characteristics and basic principles of decision tree algorithms. Decision tree algorithms can be used to solve both classification and regression problems. Compared to linear models, tree models have stronger nonlinear fitting capabilities and good interpretability. Their working principles are very similar to human thinking. Random Forest and GBDT are two types of ensemble algorithms derived from decision trees.

The design principle of Random Forest is “Three mediocre craftsmen can surpass Zhuge Liang”. By randomly selecting training samples and features on multiple trees, Random Forest integrates multiple simple models together and uses voting to determine the final prediction.

On the other hand, GBDT follows the idea of “standing on the shoulders of giants” and is also an ensemble model based on multiple trees. Unlike Random Forest, there is a dependency relationship between trees in GBDT. Each tree is trained based on the residual fitted by the previous tree, continuously approaching the true value. GBDT has a very strong fitting ability, but it is important to be aware of the overfitting risks caused by deep and numerous decision trees.

Practice for each lesson #

In light of today’s lesson, can you discuss the advantages and disadvantages of the GBDT and Random Forest model algorithms?

Feel free to engage and interact with me in the comments section, and I also encourage you to share the content of this lecture with more colleagues and friends.