13 Optimization Methods Updating Model Parameters

13 Optimization Methods Updating Model Parameters #

Hello, I am Fang Yuan.

In the previous lesson, we learned together about concepts such as feedforward networks, derivatives, gradients, and backpropagation. However, to fully understand the learning process of neural networks, we still lack an important part, which is the optimization methods. Only by understanding the optimization methods can we truly comprehend the specific process of backpropagation.

Today, we will learn about optimization methods. In order for you to establish a deeper understanding, I have also prepared an example for you, which connects all the content from these three lessons.

Understanding Optimization Methods using the Downhill Route #

Deep learning actually consists of three core processes: model representation, method evaluation, and optimization methods. The content we learned in the previous lesson was all about preparing for optimization methods.

Optimization methods refer to a process whose goal is to find the best evaluation performance for the model among all possibilities. Let’s take an example of a function f(x) that contains a set of parameters.

In this example, the objective of the optimization method is to find the weights that correspond to the minimum value of f(x). In other words, the optimization process is about finding a state that minimizes the loss function of the model, and this state is the weights of the model.

There are many common types of optimization methods, including gradient descent, Newton’s method, and quasi-Newton methods. These methods involve various mathematical knowledge. Similarly, PyTorch encapsulates optimization methods, which we can directly use in practical development, saving a lot of time and effort.

However, in order to better understand deep learning, especially the backpropagation process, it is necessary for us to have some understanding of important optimization methods. The gradient descent method, which we will learn in this lesson, is the most widely used optimization method in deep learning.

Gradient descent is actually easy to understand. Let me give you a practical example. Let’s say you go mountaineering with your friends during the holidays, and suddenly you need to use the toilet after reaching the mountain top. You need to plan your route to quickly reach the restroom at the half-mountain level. How would you plan your route?

Assuming there is no risk to life, the natural choice is to take the fastest route. In other words, the steeper the terrain, the more likely you can reach the destination quickly.

So, here is a plan: change direction every few steps, and the direction should be towards the steepest direction, which is the direction of the fastest descent. Repeat this process. This is the most intuitive representation of gradient descent.

In the previous lesson, we mentioned that “the direction of the gradient vector is the direction in which the function value grows the fastest, and the opposite direction of the gradient is the direction in which the function decreases the fastest.”

Gradient descent is the most important use of gradients in deep learning. Now let’s describe gradient descent in a relatively rigorous way.

In a multidimensional space, for any given curve, we can find a hyperplane that is tangent to it. This hyperplane will have countless directions (think why?), but among all these directions, there must be one direction that can make the function decrease the fastest, and this direction is the opposite direction of the gradient. The goal of each optimization is to proceed along this fastest descending direction, which is called gradient descent.

Specifically, in a three-dimensional space curve, we can find a tangent plane (or hyperplane in higher dimensions) for any point. This plane will have an infinite number of directions, but there is only one gradient direction that makes the curve function decrease the fastest. Once again, let me repeat: each optimization proceeds along the opposite direction of the gradient, which is called gradient descent. What function does it make descend fastest? The answer is the loss function.

Now you should be able to connect several pieces of knowledge together: “In order to obtain the minimum loss function, we use gradient descent to make it reach the minimum value.” The ultimate goal of these two lessons is to make you firmly remember this sentence.

Let’s go back to the previous example.

The red route in the figure is a good route to the restroom. However, we notice that there are alternative routes. However, even if we descend the mountain risking our lives, we still need to pay attention to the method.

For example, the size of the steps is important. If the steps are too large, you may end up taking the yellow route in the figure and ending up in another valley (a local minimum of the function) or oscillating back and forth as you get closer to and farther away from the restroom, the result of which is easy to imagine. But if the steps are too small, it will take a long time, and you may not even reach your destination before giving up (blue route).

In algorithms, the size of these steps is called the learning rate. Due to the length of the steps, it is theoretically impossible for us to accurately reach the destination, but we will eventually oscillate within a certain range around the minimum value, and there will also be some errors, but these errors are acceptable to us.

In practical development, if the loss function has not changed much over a period of time, we consider that it has reached the “lowest point” we need, and we can consider that the model has converged and end the training.

Common Gradient Descent Methods #

After understanding the principles of gradient descent, let’s take a closer look at several most commonly used gradient descent optimization methods.

1. Batch Gradient Descent (BGD) #

Linear regression is one of the most commonly used function models. Assuming for a linear regression model, y is the true data distribution function, \(h\_\\theta(x) = \\theta\_1x\_1 + \\theta\_2x\_2 + … + \\theta\_nx\_n\) is the function obtained through model training, where θ is the parameter of h and the weight we want to find.

The loss function J(θ) can be expressed as the following formula:

\[- \\operatorname{cost}=J(\\theta)=\\frac{1}{2 m} \\sum\{i=1}^{m}\\left(h\{\\theta}\\left(x^{i}\\right)-y^{i}\\right)^{2}- \]

Here, m represents the number of samples. Since we want to minimize the value of the loss function, we need to use the gradient. Do you remember when we repeatedly mentioned “the direction of the gradient vector is the direction of the fastest increase in the function value”? To make the loss function decrease at the fastest rate, we need to go in the opposite direction of the gradient.

First, we take the partial derivative of J(θ) with respect to θ, so we can obtain the gradient for each θ:

\[- \\frac{\\partial J(\\theta)}{\\partial \\theta\{j}}=-\\frac{1}{m} \\sum\{i=1}^{m}\\left(h\{\\theta}\\left(x^{i}\\right)-y^{i}\\right) x\{j}^{i}- \]

After obtaining the gradient for each θ, we can update each θ in the direction of descent, i.e.:

\[- \\theta\{j}^{\\prime}=\\theta\{j}-\\alpha \\frac{1}{m} \\sum\{i=1}^{m}\\left(h\{\\theta}\\left(x^{i}\\right)-y^{i}\\right) x\_{j}^{i}- \]

Where α is the learning rate we mentioned earlier. After updating θ, we obtain a loss function with a smaller value, and our model becomes closer to the true data distribution.

In the above formula, have you noticed the number m? That’s right, in this method, all the data is computed and then divided by it as a whole, which means averaging the errors of all samples. Here, I want to remind you that in practical development, there are often millions or even tens of millions of samples, so the amount of update would be huge. Therefore, another method is needed, called stochastic gradient descent.

2. Stochastic Gradient Descent (SGD) #

The characteristic of Stochastic Gradient Descent (SGD) is that the parameters are updated after each computation of a sample, which increases the frequency of parameter updates. The formula is as follows:

\[-{\theta_j^\prime} = {\theta_j} - \alpha \left(h_\theta(x^i) - y^i\right) x_j^i - \]

Think about it, what are the benefits of updating parameters after training each data point? That’s right, sometimes we only need to use a subset of the training data to achieve results similar to training with the entire dataset, greatly improving training speed.

However, you can’t have it all. Although SGD is fast, it also has some issues. For example, there will definitely be some incorrect or noisy samples in the training data, so in an iteration that uses such data, the optimization direction will not be moving towards the ideal direction, which may result in a decrease in training performance (such as accuracy). In the most extreme case, the model may not be able to achieve the global optimum and instead gets stuck in a local optimum.

There’s no perfect solution, sometimes we have to give up something in order to get what we want. The Stochastic Gradient Descent method sacrifices a portion of accuracy and increases the number of iterations in exchange for improved overall optimization efficiency.

Of course, the number of additional iterations in this process should still be much smaller than the number of samples.

So how do we compromise to coordinate speed and effectiveness as much as possible? Naturally, we would think about not using all of the data each time, but also not using only one data point, but using “some” data points. This is what we are going to talk about next, which is Mini-Batch Gradient Descent.

3. Mini-Batch Gradient Descent (MBGD) #

The mini-batch method is currently the most commonly used approach, which optimizes using a fixed number of data each time.

This fixed number is called the batch size. Common batch sizes are often powers of 2, such as 32, 128, 512, etc. Smaller batch sizes correspond to faster update speeds, while larger batch sizes correspond to slower update speeds. However, slower update speeds make it less likely to fall into local optima.

In fact, the specific value of the batch size also needs to be set based on the different characteristics of the project, using empirical methods or continuous experimentation. For example, for image tasks, we tend to set a slightly smaller batch size, while for NLP tasks, a larger batch size can be used.

Based on stochastic gradient descent, methods including momentum and nesterov momentum have been proposed. If you are interested in this topic, you can click here to read more.

A simple abstract example #

In the past three lessons (Lesson 11 to 13), we have learned about the concepts of loss functions, backpropagation, and optimization methods (gradient descent). These three concepts are also the most important contents in deep learning, and their core significance lies in enabling the model to continuously learn and improve its performance.

Next, let’s sum up the content of the three lessons through a simple example. It should be noted that the example below is not a runnable example, but aims to clarify the steps required for a basic PyTorch training process. You can think of it as a military training. With this demonstration example, when we go to the battlefield and implement a real usable example, we will also get twice the result with half the effort.

In a model, we need to set the following contents:

Model definition.
Loss function definition.
Optimizer definition.

Through the following code, let’s understand how to combine the above three contents in actual development. Of course, this code is an abstract version, which is intended to help you quickly understand the idea. The specific code filling still needs to be modified according to the actual project.

import LeNet # Assume that the model we use is called LeNet, first import the model definition class
import torch.optim as optim # Import the optional optimization functions provided by PyTorch
...
net = LeNet() # Declare an instance of LeNet
criterion = nn.CrossEntropyLoss() # Declare the loss function of the model, using cross-entropy loss function
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)
# Declare the optimization function, which is SGD mentioned earlier, and the parameters being optimized are the parameters inside LeNet, lr is the learning rate mentioned earlier

# Start training below
for epoch in range(30): # Set the number of times to train on all data

    for i, data in enumerate(traindata):
        # data is a batch of data obtained
        inputs, labels = data # Get the input data and its corresponding class result
        # First, use the zero_grad() function to clear the gradient, otherwise PyTorch will accumulate the gradients every time it calculates the gradients, if not cleared, the gradient calculated in the second time will be equal to the first time plus the second time
        optimizer.zero_grad()
        # Get the output result of the model, which is the current effect learned by the model
        outputs = net(inputs)
        # Get the loss function of the output result and the true class of the data
        loss = criterion(outputs, labels)
        # After calculating the loss, perform backpropagation, and after this process, the gradient will be recorded in the variable
        loss.backward()
        # Use the calculated gradient for optimization
        optimizer.step()
...

Isn’t this abstract framework very clear? First, we set the model, loss function, and optimizer. Then, for each batch of data, we calculate the output result, then calculate the loss function value, propagate the loss function backward, and optimize it using the optimization function. - Although this process seems very simple, it is the most fundamental and critical process in deep learning, and it is also the most core content we have learned through the three lessons.

Summary #

In this class, we learned about optimization methods and gradient descent algorithm, and we connected the concepts of loss function, backpropagation, and gradient descent using an example. With this knowledge, we can now train our own deep learning models given a specific model. Congratulations on patiently completing the lesson.

When you can’t remember the principles of gradient descent, you can review the example of route planning downhill. Our goal is to set a reasonable learning rate (step size) to get as close as possible to our destination (achieve a more ideal fitting effect). In more rigorous terms, as we emphasized repeatedly in the text: In order to minimize the loss function, we need to use the gradient descent method to reach its minimum value.

Let me review the main points of this lesson for you:

The reason why we use gradient descent in models is to continuously adjust the fit between the model and the real data through optimization methods.
The three commonly used gradient methods are batch, stochastic, and mini-batch. Generally, we prefer the mini-batch gradient descent method.
Finally, we summarized several key components required to train a model, such as the loss function and optimization function, using an abstract framework. This part is the most crucial process in deep learning, and I recommend that you pay special attention to it.

Practice for Each Lesson #

Is a larger batch size better?

Feel free to record your questions and achievements in the comments section, and I encourage you to share this lesson with more colleagues and friends.

I’m Fang Yuan, see you in the next lesson!