12 Calculating Gradients Forward and Backpropagation in Networks

12 Calculating Gradients Forward and Backpropagation in Networks #

Hello, I’m Fang Yuan.

In the previous lesson, we learned about the concept of the loss function and some commonly used loss functions. Do you still remember what we said back then: models can only learn with a loss function. So, how does the model learn through the loss function?

In the next two lessons, we will learn about feed-forward networks, derivatives and the chain rule, backward propagation, optimization methods, and so on. By mastering these contents, we will be able to connect the learning process of the model into a whole and thoroughly understand how to train the model through the loss function.

Now let’s take a look at the simplest feed-forward network.

Feedforward Neural Networks #

A feedforward network, also known as a feedforward neural network, is a type of neural network that “goes forward”. It is the simplest neural network and its typical characteristic is a unidirectional multi-layer structure. The simplified structure is shown in the figure below:

Combining the above diagram, let me show you the structure of a feedforward network in detail. In this diagram, you will see the green neurons on the far left, which correspond to the 0th layer. They are generally used as the input layer for receiving input data, so we call them the input layer.

For example, if we want to train a neural network for the function y=f(x), where x is a vector, it needs to flow through this green input layer into the model. In this network, the input layer has 5 neurons, which means it can accept a vector of length 5.

Referring back to the diagram, let’s continue downwards. In the middle of the network, there is a layer of red neurons, which represent the “internal” part of the model. These neurons are generally not visible to the external world or not of interest to the user, so we call them the hidden layer. In actual network models, there can be many layers in the hidden layer, making it the critical internal core of the network and the key part where the model can learn knowledge.

On the right side of the diagram, there are blue neurons, which represent the last layer of the network. When the internal computation of the model is completed, it needs to be outputted to the external world through this layer, so it is also called the output layer.

It should be noted that the connections between neurons represent the weights of the connections between neurons. By using these weights, we can determine the importance of each node in the network.

Now, let’s take another look at the name “feedforward neural network”. It should be easier to understand now. In a feedforward network, data flows from the input layer to the first layer of the hidden layer, then propagates to the second layer, the third layer… all the way to the output layer for final output. The data propagation is unidirectional and can only move forward, without any backward flow.

Derivatives, Gradients, and Chain Rule #

Since there is forward data propagation, there naturally exists a process of backward data propagation.

When it comes to backpropagation, we often mention terms like gradient descent and chain rule. However, if you are encountering these words for the first time, searching for their definitions can often lead to confusion. In reality, these concepts are not very complicated; it’s just that your learning path might be flawed.

So, in the following, I will review the derivatives and partial derivatives that you studied in your calculus course, so that you can review the necessary knowledge for backpropagation and better understand its principles.

Derivatives #

A derivative is also known as the value of a derivative function.

Do you remember the concept of slope that we learned in high school mathematics? For example, for a function \(F=2x^2\), its derivative \(F’=4x\). In fact, slope is a special case of a derivative.

The more general case is also easy to derive. Let’s take \(F=3x\) as an example. When \(x=3\), the value of the function is \(3x=3*3=9\). Now, let’s give \(x\) a very small increment \(Δx\). Then, we have \(F(x+Δx)=3(x+Δx)\), which means that the value of the function also has a very small increment, which we denote as \(Δy\).

When the ratio of the increment \(Δy\) of the function value to the increment \(Δx\) of the variable \(x\) approaches 0, if the limit \(a\) exists, we call \(a\) the derivative of the function \(F(x)\) at \(x\).

It should be noted that \(Δx\) must approach 0, and the limit \(a\) must exist. However, in this lesson, the definition of limits and how to determine them are not the core content. If you are interested, you can search for related materials on your own.

By comparing the formula below, you will have a clearer understanding of derivatives. In fact, the slope in high school mathematics is a special case of a derivative. The derivative is usually described using the following notation:

\[- f^{\\prime}\\left(x\{0}\\right)=\\lim \{\\Delta x \\rightarrow 0} \\frac{\\Delta y}{\\Delta x}=\\lim \{\\Delta x \\rightarrow 0} \\frac{f\\left(x\{0}+\\Delta x\\right)-f\\left(x\_{0}\\right)}{\\Delta x}- \]

Here, \(lim\) represents the limit. Also, the derivative of function \(y\) with respect to \(x\) can be denoted as \(\\frac{\\partial y}{\\partial x}\).

Partial Derivatives #

Keen observers might have a question when they reach this point. Some functions have more than one variable, like \(z=3x+2y\). In this function, both \(x\) and \(y\) exist as variables. So, how do we calculate their derivatives?

Don’t worry, this is where partial derivatives come into play. Partial derivatives are the derivatives obtained by keeping one variable constant while differentiating with respect to another variable.

Following the principles we mentioned earlier, let’s assume we have a function \(z=f(x,y)\). When we want to calculate the derivative in the direction of \(x\), we can give \(x\) a very small increment \(Δx\) while keeping \(y\) constant. Conversely, if we want to calculate the derivative in the direction of \(y\), we need to give \(y\) a very small increment \(Δy\) while keeping \(x\) constant. Therefore, we can derive the following formula for partial derivatives:

\[- \\frac{\\partial}{\\partial x\{j}} f\\left(x\{0}, x\{1}, \\ldots, x\{n}\\right)=\\lim \{\\Delta x \\rightarrow 0} \\frac{\\Delta y}{\\Delta x}=\\lim \{\\Delta x \\rightarrow 0} \\frac{f\\left(x\{0}, \\ldots, x\{j}+\\Delta x, \\ldots, x\{n}\\right)-f\\left(x\{0}, \\ldots, x\{j}, \\ldots, x\{n}\\right)}{\\Delta x}- \]

Although the formula above may appear complex, if you examine it closely, you will notice that only \(x\) in the respective dimension \(j\) has a small increment \(Δx\), meaning we have increased it by a small amount.

Let’s look at a specific example to deepen our understanding. For the function \(z=x^{2}+y^{2}\), \(\\frac{\\partial z}{\\partial x}=2x\) represents the derivative of the function \(z\) with respect to \(x\), while \(\\frac{\\partial z}{\\partial y}=2y\) represents the derivative of the function \(z\) with respect to \(y\).

Gradient #

Once we understand the concepts of derivatives and partial derivatives, the concept of gradients becomes very easy to grasp. The vector composed of all the partial derivatives of a function is called the gradient. Isn’t it simple? We generally use \(\nabla f\) to represent the gradient of a function. Its mathematical expression is:

\[- \nabla f(x)=\left[\frac{\partial f}{\partial x_{1}}, \frac{\partial f}{\partial x_{2}}, \ldots, \frac{\partial f}{\partial x_{i}}\right]- \]

Regarding the gradient, you must remember this conclusion: The direction of the gradient vector is the direction of the fastest increase in the function value.

This is a very important conclusion that runs through the entire process of deep learning. In order for a model to learn knowledge, it needs to be done in the fastest and best way, which is achieved by using gradients. However, the proof process and mathematical knowledge involved in this conclusion are quite extensive, so you only need to remember the conclusion.

Chain Rule #

The entire learning process of deep learning is actually a process of updating the weights between network nodes. These weights are the connections between the nodes that we saw in the diagram of the feedforward network, and we generally use \(w\) to represent them.

Recall from the previous lesson that the model learns by continuously reducing the value of the loss function. To minimize the loss function, we usually use gradient descent, which means that every time we update the weights of the model, we do so in the opposite direction of the gradient.

Why? Because the direction of the gradient vector is the direction of the fastest increase in the function value, and the opposite direction is the direction of the fastest decrease.

The content of the above paragraph is very, very important. To ensure that you understand it, let me repeat it in a different way: The model learns by continuously reducing the value of the loss function in the opposite direction of the gradient using gradient descent.

Okay, let’s take a closer look at a formula to deepen our understanding. Suppose we represent the loss function as: \(- H\left(W_{11}, W_{12}, \ldots, W_{ij}, \ldots, W_{mn}\right)- \)

Here, \(W_{ij}\) represents the weight value corresponding to the jth node in the ith layer. The gradient vector \(\nabla H\) of the loss function is:

\[- \left[\frac{\partial H}{\partial w_{11}}, \quad \frac{\partial H}{\partial w_{12}}, \ldots, \quad \frac{\partial H}{\partial w_{ij}}, \ldots, \quad \frac{\partial H}{\partial w_{mn}}\right]- \]

By now, do you notice any problems? Yes, it seems that this formula is quite complex and gives you a headache. For example, in the first term, I have no idea about the relationship between \(w_{11}\) and \(H\), as there are so many layers in between.

This is where the chain rule comes into play: “The derivative of a composite function is equal to the derivative of the outer function evaluated at the inner function’s value times the derivative of the inner function.” This rule includes two forms:

\[- \frac{d y}{d x}=f’\left(g(x)\right) g’(x)- \]

\[- \frac{d y}{d x}=\frac{d y}{d u} \cdot \frac{d u}{d x}- \]

You may still feel confused at this point, but don’t worry. Let me explain it further with a more specific example, and then you will know how to calculate it.

Suppose we have a function \(f(x)=\cos\left(x^{2}-1\right)\). We can decompose the function as:

\[- \text{1. } f(x)=\cos (x)- \]

\[- \text{2. } g(x)=x^{2}-1- \]

The derivative of \(g(x)\), \(g’(x)=2x\), and the derivative of \(f(x)\), \(f’(x)=-\sin (x)\). Therefore, \(f’(x)=f’(g(x)) g’(x)=-\sin \left(x^{2}-1\right) 2x\), which is equivalent to multiplying their individual derivatives.

At this point, you may start to have some understanding. You need to carefully read and understand this section, combining the formulas and the example I provided. I believe you will be able to master it.

Backpropagation #

After understanding derivatives, partial derivatives, gradients, and the chain rule in the previous sections, we are ready to delve into the study of backpropagation. You will find that the efforts we put in earlier were not in vain.

Backpropagation is currently the most commonly used and effective algorithm for training neural networks. The model continuously updates its parameters through backpropagation in order to “learn” knowledge.

The main principles of backpropagation are as follows:

Forward propagation: Data flows from the input layer through the hidden layers and eventually outputs a result. This process is similar to the previously discussed feed-forward network.
Compute and propagate the error: Compute the error between the model’s output and the actual result, and propagate this error by some means, that is, from the output layer to the hidden layers and eventually reaching the input layer.
Iteration: During backpropagation, continuously adjust the model’s parameter values based on the error and iterate the previous two steps until the training process meets the termination condition.

Among these steps, there are two key components: propagating the error by some means, and continuously adjusting the model’s parameter values based on the error.

These two components are collectively referred to as optimization methods. Generally, gradient descent is widely used. This is where we need the knowledge of derivatives, gradients, and the chain rule. We will delve into gradient descent in the next lesson.

The mathematical derivation and proof process of backpropagation are very complex. In the actual development process, deep learning frameworks such as PyTorch and TensorFlow have encapsulated the backpropagation process. Therefore, we don’t have to manually implement this process. However, as a deep learning developer, it is still important to have a deep understanding of how this process works, in order to grasp how models learn in deep learning.

Summary #

In this class, we learned about the simplest type of neural network, the feedforward network.

Although the feedforward network is simple, its concept runs throughout the entire process of deep learning and is a very important concept. At the same time, we learned about derivatives, gradients, and the chain rule. These are the most important knowledge points for models to learn knowledge through backpropagation, and they are also the core content of deep learning. You must master them firmly.

Finally, we had a preliminary understanding of the process and concept of backpropagation, which laid the foundation for our formal study of how to compute backpropagation.

In today’s content, I simplified the relevant mathematical knowledge as much as possible while retaining the most essential content. However, in the research of deep learning, there are many mathematical knowledge points involved. If you are interested, you can consult more related materials after class to continuously improve.

In the next class, I will teach you about optimization functions. After learning about optimization functions, we can formally start the process of computing backpropagation.

Practice for Each Lesson #

Is deep learning always based on backpropagation?

Feel free to interact with me in the comments section, and I also encourage you to share today’s content with more colleagues and friends.

I am Fang Yuan, and I’ll see you in the next lesson!