11 Loss Function How to Help Models Learn to Save Themselves

11 Loss Function How to Help Models Learn to Save Themselves #

Hello, I’m Fang Yuan.

In the previous lessons, we covered the necessary foundational knowledge for practical deep learning, including basic operations of PyTorch, NumPy, Tensor characteristics, and usage methods. We also learned about data-related operations and features based on Torchvision. Congratulations on making it this far! With a solid foundation, we have taken another step towards the practical stage.

With the foundational knowledge in place, we can now start learning several important concepts in deep learning.

A deep learning project consists of several main modules, including model design, loss function design, gradient update methods, model saving and loading, and the training process of the model. Each module plays a significant role in the construction of a deep learning project. I have specially prepared a diagram for you to help you grasp their functions as a whole.

In this lesson, let’s start with the loss function. The loss function is a measure of the model’s learning effectiveness, and it can even be said that the training process of the model is actually the process of optimizing the loss function. If you go for a machine learning job interview, you will often be asked about topics like forward propagation and neural networks. In fact, the assessment of these knowledge areas inevitably involves concepts related to the loss function.

Today, I will use the example of recognizing a Rolls-Royce car to help you understand the working principle and common types of loss functions.

A Simple Example #

Think about the general process of learning new knowledge. For example, if I asked you to memorize a word, let’s use an exaggerated example:

Pneumonoultramicroscopicsilicovolcanoconiosis (a lung disease).

In order to memorize this word, you would have to repeatedly look at and remember it. The first time, you might remember the first few letters, then the next time you might remember a few letters from the middle, and then the next time you might remember a few letters from the end. It is through continuous repetition and study that you can grasp the accurate composition of this word. To test your learning progress, the teacher might even ask you to spell the word from memory and compare it to the standard spelling.

The previous example used natural language, but what about visual information? For example, if I show you a picture of a Rolls-Royce car and ask you to remember it, let’s say this is a Rolls-Royce that you can never afford in your lifetime.

How would you remember it? Right, you would subconsciously look for the most representative features, such as the square grille in the front, the small figurine standing upright at the front, the angular car body, and so on.

In the future, when you see a car with these characteristic features, you will know it is a Rolls-Royce that you should stay away from. However, if these features change, you may hesitate or doubt whether it is another brand of car.

In fact, the learning process of a model is similar. At the beginning, a model is like a blank piece of paper with no knowledge. As developers, we need to continuously provide the model with data to learn from.

Once the model receives the data, there is a very important step: comparing the model’s own judgment with the real situation in the data. If the deviation or difference is particularly large, then the model needs to correct its judgment and reduce this deviation in some way. This process is repeated until the model can make correct judgments on the data.

It is important to measure this deviation and is the key to the model’s learning progress. This process of reducing deviation is called fitting. Next, let’s take a look at several situations of fitting together.

Overfitting and Underfitting #

Let’s first learn about the concepts of overfitting and underfitting. To help you understand, let’s use an example of a function curve.

First, let’s assume there are several points in a two-dimensional coordinate system, and we need a function (model) to fit these points as closely as possible through learning. What are the possible results of fitting? Let’s take a look at the image below:

In the first image, the blue curve is the first model function (H1) we have learned. We find that H1 seems to have not fit these points very well, or in other words, the fit between the function and the sample points is poor, with only a rough trend that matches. This situation is called “underfitting”.

Since there is “under”, there must be “over”. Let’s continue to look at the second image.

In this image, the red curve is the second model function (H2) we have learned, and we can see that the function curve can fit all the points very well.

However, there are two problems: first, the function corresponding to the curve is too complex and not as simple and clear as H1; second, if we add another point near the curve of H2, this curve will be difficult to fit well. This situation is called “overfitting”, it’s just too much.

Now let’s look at the third image. The curve in this image is more reliable, it’s not too complicated, and it can fit most of the points well.

Seeing this, you may be wondering why we care so much about the issue of “complexity”? In fact, you can think of it this way: there are two functions, $y1=3x^2 + 2$ and $y2=3x^7 + 7x^6 + 6x^2 + 4x+18$. Both of them, in terms of interpretability, simplicity, and computational complexity, y1 is much better than y2.

The more complex a function is, the more computational resources and time it requires in practical work. Of course, we can’t blindly pursue simplicity, otherwise we will underfit.

Loss Function and Cost Function #

The concepts of overfitting and underfitting are actually about the performance of the model. Next, let’s take a look at the concepts of loss function and cost function, which are methods we use to measure “deviation” and “effect”.

Let’s continue with the example of using functions. Suppose in the previous two-dimensional space, the true function corresponding to any point is F(x). The function fitted by our model through learning is f(x). According to the learning process mentioned earlier, we know that there is an error between F(x) and f(x), which we define as L(x):

\[- L(x)=(F(x)-f(x))^{2}- \]

Here, we take the squared difference between F(x) and f(x) so that the error is a positive value, which is convenient for subsequent calculations. Of course, you can also use the absolute value, and we will talk about gradient updates in later courses. Then you will find that the squared difference is more convenient than the absolute value. Just have an impression for now and let’s get back on track.

With L(x), we have a measure of how well the fitted function performs, which is called a loss function. As the formula shows, the smaller the loss function, the better the fitting effect of the fitted function on the true situation. Here, you need to pay attention that there are many types of loss functions, and L(x) is just the first one we learned.

Next, we expand the data from any single point to all points, and these points are actually a training set. By averaging the fitting errors corresponding to all the points in the set, we will get the following formula:

\[- \\frac{1}{N} \\sum\_{i=0}^{N}(F(x)-f(x))^{2}- \]

This function is called the cost function, which is the average value of the fitting errors of all samples in the training sample set. We also call the cost function empirical risk.

In fact, in practical applications, we do not strictly distinguish between the loss function and the cost function. You only need to know that the loss function is the error of a single sample point, and the cost function is the error of all sample points. Once you understand this, it doesn’t matter if you mix up the names.

Common Loss Functions #

After understanding the definition of loss functions, let’s take a look at the common loss functions.

Strictly speaking, there are infinitely many types of loss functions. This is because the loss function is used to measure the difference between the model’s fitting effect and the true value, and the measurement method needs to be specifically customized according to the characteristics of the problem or the aspect to be optimized. Therefore, the types of loss functions are endless.

As a beginner, I recommend starting with some commonly used loss functions. Today, let’s take a look at the 5 most basic ones.

0-1 Loss Function

Suppose we have a classification problem, such as asking the model to determine whether the user input is a number. The model’s prediction has only two possibilities: “yes” and “no”.

Therefore, we easily come up with a simple evaluation method: if the model predicts correctly, the loss function value is 0 because there is no error; if the model predicts incorrectly, the loss function value is 1. This is the simplest 0-1 loss function, which is represented by the following formula:

\[- L(F(x), f(x)) = \\left\\{\\begin{matrix}- 0 & if F(x) \\ne f(x)\\\\\\- 1 & if F(x) = f(x)- \\end{matrix}\\right.- \]

In this formula, F(x) is the true class of the input data, and f(x) is the predicted class by the model. It’s simple, isn’t it?

However, the 0-1 loss function is rarely used because the commonly used gradient update and backpropagation in model training require loss functions that can be differentiated. The derivative of the 0-1 loss function is 0 (the derivative of a constant is 0), so it is not widely used.

Nevertheless, we must also understand the 0-1 loss function because it is the simplest loss function and has significant significance.

Squared Loss Function

When we discussed the definition of loss functions earlier, we gave an example $L(x)=(F(x)-f(x))^{2}$, which is formally called the squared loss function. Sometimes, we add a coefficient of 1/2 to the loss function so that it can be reduced when taking derivatives with the coefficient of the squared term.

The squared loss function is the simplest differentiable loss function. It directly measures the distance between the model’s fitting result and the true result. In actual projects, many simple problems, such as handwritten digit classification and flower recognition, can use this simple loss function.

Mean Squared Error Loss Function and Mean Absolute Error Loss Function

Before formally explaining the mean squared error loss function, let’s supplement an important background knowledge: machine learning can be divided into supervised learning and unsupervised learning.

Supervised learning infers the function from labeled training data sets, which can be seen as a learning process where a student (the model) is “guided” and “supervised” by a teacher (the data). Supervised learning problems can be mainly divided into two categories: classification and regression. Regression problems predict a numerical value based on data.

The mean squared error (MSE) is the most commonly used loss function in regression problems, also known as the L2 loss function. It is the sum of squared differences between the predicted values and the target values. It is defined as follows:

\[- M S E=\\frac{\\sum\{i=1}^{n}\\left(s\{i}-y\_{i}^{p}\\right)^{2}}{n}- \]

s represents the vector representation of the target value, and y represents the vector representation of the predicted value.

You may have noticed that the squared loss function looks similar, right? Yes, these two forms are essentially equivalent. The only difference is that the MSE calculates the average error of the entire sample, which is obtained by summing up the errors and dividing by n. The formulas for the sum of squared errors and mean squared error have a coefficient of 1/2, which is eliminated after taking the derivative.

The Mean Absolute Error (MAE) loss function is another commonly used loss function for regression problems. Its goal is to measure the sum of the absolute differences between the true values and the predicted values, defined as follows:

\[- M A E = \frac{\sum_{i=1}^{n}|y_{i}-y_{i}^{p}|}{n} -\]

Cross-Entropy Loss Function #

Next, let’s take a look at the cross-entropy loss function.

The concept of entropy may be unfamiliar to some of you, just like before. Let’s first briefly understand what entropy is. Entropy was initially a term in physics, which represents the degree of disorder or randomness in a system. The more chaotic a system is, the higher its entropy.

Later, the founder of information theory, Shannon, extended this concept to the process of channel communication, creating information theory. Therefore, entropy is also called information entropy here. The formalization of information entropy can be expressed as:

\[- H = -\sum_{i} p(x_{i}) \log p(x_{i}) -\]

Here, x represents a random variable, and the corresponding set is all possible outputs. P(x) represents the output probability function. The greater the uncertainty of the variable, the greater the entropy, and the greater the amount of information needed to understand the variable.

When we transform the function into the following format, changing log p to log q:

\[-\sum_{i=1}^{n} p(x_{i}) \log \left(q(x_{i})\right) -\]

Where 𝑝(𝑥) represents the true probability distribution and 𝑞(𝑥) represents the predicted probability distribution. This function is the cross-entropy loss function. It simultaneously measures the errors between the true probability distribution and the predicted probability distribution. So, this function actually measures and continuously tries to reduce the error between the two probability distributions, making the predicted probability distribution as close as possible to the true probability distribution.

Softmax Loss Function #

Softmax is a frequently used function in deep learning. In some scenarios, some numerical values have a wide range, and for the convenience of calculation or better gradient update (we will learn about gradient update later), we need to map these input values to real numbers between 0 and 1 and normalize them so that the sum of several numbers is 1.

Its formal representation is: - $$- S_{j}=\frac{e^{a_{j}}}{\sum_{k=1}^{T} e^{a_{k}}}-$$

Going back to the cross-entropy loss function, the q(xi) in the formula, which represents the predicted probability distribution, can be replaced with the softmax representation, that is:

\[\sum_{i=1}^{n}p(x_i)log(S_i)\]

Afterwards, we obtain a new function called softmax loss function (softmax loss), also known as softmax with cross-entropy loss. It is a special case of the cross-entropy loss function.

There are many types of loss functions, and here I have chosen the most commonly used ones. In the upcoming practical section, we will encounter more loss functions, and I will explain them in more detail.

Summary #

In this lesson, we have learned about the principles of a loss function. For a model, the loss function serves as a measure of its performance. With this measure, the model can determine whether there are biases in its learning process and the magnitude of these biases, thereby achieving self-improvement.

Although we have discussed several formulas today, it is not necessary for you to memorize them. I would like to remind you that it is important to go through these formulas once and understand their principles. Without laying this foundation, you will not be able to distinguish between different loss functions.

In practical development, the setting of a loss function is crucial and can even be as important as the design of the model network. This is because without a good guiding loss function, all the efforts would be in vain. For example, in the simplest case of handwritten recognition, the loss function calculates the difference between the model’s output and the true value. Through this loss function, our model can determine whether it has learned correctly or incorrectly and can truly learn effectively.

Next, we will start learning how to update model parameters using loss functions. This is also a very interesting topic, so stay tuned!

Practice for each lesson #

Is it better to have a smaller value for the loss function?

Feel free to interact with me in the comments section, and I also encourage you to share today’s content with more colleagues and friends.

I am Fang Yuan, see you in the next lesson!