09 Convolution How to Use Convolution to Give Computers Vision #

Hello, I’m Fang Yuan.

Nowadays, face payment is becoming more and more common, and I believe you are familiar with face recognition. Have you ever wondered how humans determine who a person is before computers can recognize faces?

When our eyes see a face, we first extract some coarse-grained features, such as the contour of the face, the color of the hair, and the length of the hair. Then, these pieces of information are passed through certain neurons layer by layer, and each layer of neurons is like a feature extractor. Our brain eventually summarizes the final features, similar to summarizing them into a specific face, and matches this face with a stored name in a certain part of the brain.

So how does this apply to our computers? In fact, the process is the same, and the function of feature extraction in computers cannot be separated from what we are going to talk about today: convolution.

It can be said that without convolution, deep learning would not have achieved its current success in the field of image processing. So, let’s take a look at what convolution is and how it is implemented in PyTorch.

Convolution #

Before the use of convolution, many artificial neural networks were attempted to solve image processing problems. However, the large number of parameters in these networks made them difficult to train, leading to a stagnation in computer vision research.

The emergence of convolutional neural networks (CNNs) changed the game with two remarkable characteristics: sparse connectivity and translation invariance. These features allowed significant progress in computer vision research. So, what are these two characteristics? To put it simply, sparse connectivity reduces the number of parameters to be learned, while translation invariance means that the network does not care about the position of objects in the image.

Sparse connectivity and translation invariance are two important characteristics of convolution. If you want to work in the field of computer vision, you must understand these two characteristics. However, they are not the focus of this column, so I won’t go into detail here. If you are interested, you can explore them on your own.

Now let’s take a look at how convolution is computed.

The Simplest Case #

Let’s start with the simplest case where the input is a 4x4 feature map and the size of the convolutional kernel is 2x2.

What is a convolutional kernel? It is simply the parameter that the convolutional layer learns, as shown in the example of the red kernel in the following image. In this example, the convolutional kernel has only one channel.

To compute the output feature map, we perform element-wise multiplication of the kernel with the input feature map, followed by summing the results. The result is a single element of the output feature map. The following image shows the computation of the first element of the output feature map:

After computing the first element, we continue to slide the kernel from left to right and top to bottom in a sequential manner. The following image shows the computation of the next element, where the kernel is shifted one unit to the right:

This process continues for the remaining elements in the first row. Once we have finished computing all the elements in the first row, we move the kernel back to the beginning of the row and slide it down one unit. We then repeat the left-to-right sliding computation.

Let me introduce another concept, stride.

Stride refers to the length of the sliding movement in convolution. In the example above, the stride is 1. Depending on the specific problem, different stride values can be chosen, but typically stride is set to 1 or 2. Both the simple convolution we just discussed and the standard convolution we will discuss later require this parameter.

Standard Convolution #

So far, we have only looked at the simplest case. Now let’s extend the computation to the standard convolution.

Let’s describe the previous example in a more general way. The input feature map has m channels, width w, and height h. The output feature map has n channels, width \(w^{\\prime}\), and height \(h^{\\prime}\). The size of the convolutional kernel is kxk.

In the previous example, the values of m, n, k, w, h, \(w^{\\prime}\), and \(h^{\\prime}\) were 1, 1, 2, 4, 4, 3, and 3, respectively. Now, let’s consider the case where the input is an input feature map (m, h, w) and we want to compute an output feature map (n, \(h^{\\prime}\), \(w^{\\prime}\)).

Now, let’s take a look at what this operation would look like. The number of channels in the output feature map is determined by the number of convolutional kernels, which is n. According to the definition of convolution, each convolutional kernel requires m channels in order to perform the computation. So, we need n convolutional kernels, each with a size of (m, k, k).

To help you better understand this, I have created a diagram for reference:

As shown in the diagram above, convolutional kernel 1 performs the computation with all input features, resulting in the first channel of the output feature map. Convolutional kernel 2 performs the computation with all input features, resulting in the second channel of the output feature map. This procedure continues until all n output feature maps are computed.

In the previous example, there was only one channel in the input. Now that we have multiple channels, how do we perform the computation? The process is similar. Each channel of the input feature maps is convolved with the corresponding channel of the convolutional kernel using the same method as before. This results in m feature maps, which are then summed element-wise to obtain a single channel of the output feature map. We can use the following formula to express how each convolutional kernel is computed with multiple input channels.

\(Output_i\) represents the computation of the i-th output feature map, where i ranges from 1 to n.

\(kernel_k\) represents the k-th channel data in one convolutional kernel.

\(input_k\) represents the k-th channel data in the input feature map.

\(bias_k\) denotes the bias term, which is typically added during training.

\(\\star\) denotes the convolution operation.

\[Output_i = \\sum\_{k=0}^{m}kernel_k \\star input_k + bias_i, \\space \\space \\space \\space i=1,2,…,n\]

Let me explain why the bias term is added. It’s similar to a regression equation. Without the bias term, the regression equation is y=wx, which always passes through the origin regardless of the variation in w. By adding the bias term, the regression equation becomes y=wx+b, which is not constrained to pass through the origin and allows more diverse variations.

Now, let’s move on to another important parameter related to convolutional computation in the convolutional layer.

Padding #

Let’s go back to the example we started with. We can observe that the input size is 4x4 and the output size is 3x3. Have you noticed that the output feature map has become smaller? Yes, in neural networks with multiple convolutional layers, the feature maps tend to get smaller.

However, sometimes we want the feature maps to remain relatively larger, so we can perform zero-padding on the feature maps. There are two main purposes of doing this:

Sometimes we want the input and output feature maps to have the same size.
Padding allows us to preserve more information in the input features.

Let me give you an example to illustrate when we might want the feature maps to be relatively larger.

As we learned earlier, if we don’t perform padding and the stride is 1, the feature maps will gradually become smaller as we add more convolutional layers. If we want to have more convolutional layers to extract richer information, we can slow down the speed at which the feature maps become smaller by using padding.

This operation of adding zeros is called padding. When padding is set to 1, it means adding an extra layer of zeros. When it’s set to 2, it means adding two layers of zeros. The following image illustrates this:

In PyTorch, the padding parameter can be a string, an integer, or a tuple.

Let’s take a look at how different types of parameters are used: When using a string, the options are either ‘valid’ or ‘same’. When given as an integer, it specifies how many layers of zeros are added around the feature map. If it’s a tuple, it specifies the number of zeros to add in rows and columns of the feature map.

Let’s focus on the string form. I think it’s more commonly used compared to specifying the number of zeros directly. Among the two options, ‘valid’ means no padding operation, just like in the example at the beginning. ‘same’ ensures that the output feature map has the same size as the input feature map.

So, how is the padding calculation done when ‘same’ is used? Let’s continue using the example we started with and assume we are applying ‘same’ padding now.

As we slide the kernel to the rightmost position, we notice that the output feature map’s width is different from the input feature map’s width. In this case, the padding mechanism automatically adds zero-padding until the output feature map’s width matches the input feature map’s width. The following image illustrates this:

The calculation for the height is similar to the calculation for the width. When we reach the bottom of the feature map, if the output feature map’s height is different from the input feature map’s height, padding is added until the two match, as shown in the following image.

By performing these operations, we obtain an output feature map with the same height and width as the input feature map. We have covered the theoretical explanation, but it’s important to apply this knowledge in practice. In the exercises below, we will investigate whether the calculation matches what we discussed when padding is set to ‘same’.

Convolution in PyTorch #

The convolution operation is defined in the torch.nn module, which provides us with many basic layers and methods for building networks.

In the torch.nn module, there are three classes related to the convolution operation we are discussing today: nn.Conv1d, nn.Conv2d, and nn.Conv3d.

Please note that the examples we mentioned above are based on nn.Conv2d. nn.Conv2d is the most commonly used class, while nn.Conv1d and nn.Conv3d are rarely used and only differ in the dimension of the input feature map.

Let’s first take a look at the necessary parameters for creating an nn.Conv2d:

# Conv2d class
class torch.nn.Conv2d(in_channels,
                      out_channels,
                      kernel_size,
                      stride=1,
                      padding=0,
                      dilation=1,
                      groups=1,
                      bias=True,
                      padding_mode='zeros',
                      device=None,
                      dtype=None)

Let’s go through these parameters one by one. First, there are two parameters related to the channels: in_channels refers to the number of input feature map channels and has an int data type. In the explanation of standard convolution, in_channels is denoted as m. out_channels refers to the number of output feature map channels and has an int data type. In the explanation of standard convolution, out_channels is denoted as n.

kernel_size is the size of the convolution kernel and has an int or tuple data type. It is important to note that only the height and width of the convolution kernel need to be specified. In the explanation of standard convolution, kernel_size is denoted as k.

stride is the stride for sliding, and it has an int or tuple data type. The default value is 1, and the examples shown above use a stride of 1.

padding is used for zero-padding. Note that when padding is set to ‘valid’ or ‘same’, the stride must be 1.

Both kernel_size, stride, and padding can be tuples. When they are tuples, the first dimension is used for height information, and the second dimension is used for width information.

bias indicates whether to use a bias term.

There are two additional parameters: dilation and groups. We will explain the details of these parameters in the next lesson, so just keep an impression for now.

Verifying the ‘same’ Padding #

Next, let’s do an exercise to verify whether the calculation of ‘same’ padding is as we described. The process is not complicated and consists of three steps: creating an input feature map, setting up the convolution, and obtaining the output result.

First, let’s create the (4, 4, 1) input feature map from the example:

import torch
import torch.nn as nn

input_feat = torch.tensor([[4, 1, 7, 5], [4, 4, 2, 5], [7, 7, 2, 4], [1, 0, 2, 4]], dtype=torch.float32)
print(input_feat)
print(input_feat.shape)

# Output:
tensor([[4., 1., 7., 5.],
        [4., 4., 2., 5.],
        [7., 7., 2., 4.],
        [1., 0., 2., 4.]])
torch.Size([4, 4])

In the second step, let’s create a 2x2 convolution. Based on the previous explanation, the input channel number is 1, the output channel number is 1, and the padding is ‘same’. Therefore, the convolution is defined as:

conv2d = nn.Conv2d(1, 1, (2, 2), stride=1, padding='same', bias=True)
# By default, the parameters are randomly initialized
print(conv2d.weight)
print(conv2d.bias)
# Output:
Parameter containing:
tensor([[[[ 0.3235, -0.1593],
          [ 0.2548, -0.1363]]]], requires_grad=True)
Parameter containing:
tensor([0.4890], requires_grad=True)

What needs to be noted is that by default it is randomly initialized. In general, we do not manually intervene in the initialization of the convolution kernel, but to verify today’s example, we intervene in the initialization of the convolution kernel. Please pay attention to the comments for the convolution kernel in the code below:

conv2d = nn.Conv2d(1, 1, (2, 2), stride=1, padding='same', bias=False)
# The convolution kernel should have four dimensions (input channels, output channels, height, width)
kernels = torch.tensor([[[[1, 0], [2, 1]]]], dtype=torch.float32)
conv2d.weight = nn.Parameter(kernels, requires_grad=False)
print(conv2d.weight)
print(conv2d.bias)
# Output:
Parameter containing:
tensor([[[[1., 0.],
          [2., 1.]]]])
None

After completing the step, we move on to the third step. Now we have prepared the input data and the convolution data in the example. We just need to calculate it and then output it. The code is as follows:

output = conv2d(input_feat)
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
/.../ in <module>
----> 1 output = conv2d(input_feat)
/.../torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1050                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1051             return forward_call(*input, **kwargs)
   1052         # Do not call functions when jit is used
   1053         full_backward_hooks, non_full_backward_hooks = [], []
/.../torch/nn/modules/conv.py in forward(self, input)
    441 
    442     def forward(self, input: Tensor) -> Tensor:
--> 443         return self._conv_forward(input, self.weight, self.bias)
    444 
    445 class Conv3d(_ConvNd):
/.../torch/nn/modules/conv.py in _conv_forward(self, input, weight, bias)
    437                             weight, bias, self.stride,
    438                             _pair(0), self.dilation, self.groups)
--> 439         return F.conv2d(input, weight, bias, self.stride,
    440                         self.padding, self.dilation, self.groups)
    441 
RuntimeError: Expected 4-dimensional input for 4-dimensional weight[1, 1, 2, 2], but got 2-dimensional input of size [4, 4] instead

Looking at the code above, you will notice that an error has occurred here, and the error message states that the input feature map needs to be 4-dimensional, but our input feature map is a 2-dimensional feature map of size 4x4. Why is that? Please remember that in PyTorch, the dimension information of the input tensor is (batch_size, number of channels, height, width), but in our example, only the height and width are given, and batch_size (when training, not all data is loaded into training at once, but is read in multiple batches, and each reading is called batch_size) and the number of channels are not given. So, we need to go back to the first step and change the input tensor to the form (1, 1, 4, 4).

Do you remember how to add dimensions to an array?

In PyTorch, unsqueeze() is used to modify the dimensions of a tensor. The code is as follows:

input_feat = torch.tensor([[4, 1, 7, 5], [4, 4, 2, 5], [7, 7, 2, 4], [1, 0, 2, 4]], dtype=torch.float32).unsqueeze(0).unsqueeze(0) print(input_feat) print(input_feat.shape)

Output: #

tensor([[[[4., 1., 7., 5.], [4., 4., 2., 5.], [7., 7., 2., 4.], [1., 0., 2., 4.]]]]) torch.Size([1, 1, 4, 4])


Here, the parameter in unsqueeze() determines where to add dimensions. Alright, after making the modifications, let's execute the code again.

output = conv2d(input_feat) Output: tensor([[[[16., 11., 16., 15.], [25., 20., 10., 13.], [ 9., 9., 10., 12.], [ 1., 0., 2., 4.]]]])


You can see if it matches the result we derived in the example.
## Summary

Congratulations on completing today's lesson. Convolution, which was discussed today, is very important as it serves as the foundation for various computer vision applications, such as image classification, object detection, and image segmentation.

The calculation method of convolution is the key point that you need to pay attention to. The specific process is shown in the figure below. The number of channels in the output feature map is determined by the number of convolution kernels. In the figure below, there are n convolution kernels, so the number of channels in the output feature map is n. The **input feature map has m channels, so each convolution kernel should also have m channels**.

![Convolution process](../images/47ce82502ded4b128c0b91c750ae2674.jpg)

In fact, the theory behind convolution is quite complicated, but it is easy to implement in PyTorch. The key parameters for convolution calculation are the number of input channels, number of output channels, stride, padding, and size of the convolution kernel, which correspond to the key parameters of nn.Conv2d in PyTorch. Therefore, as mentioned before, we need to become proficient in using nn.Conv2d().

Later, I guided you through a practice of using the "same" padding method. Running the code hands-on will help you form an intuitive impression and quickly grasp this part of the content.

Of course, for convolution, there are not only the standard convolutions that were introduced today, but also various variations. For example, the "dilation" parameter and the "groups" parameter, which were not discussed today, are used to implement convolution operations. I will explain these based on these two parameters in the next lesson. Stay tuned.
## Practice for Each Lesson

Please think about it: when the padding is set to 'same', can the stride be a value other than 1?

Feel free to leave your questions or discoveries in the comments section, and I recommend you share this lesson with more friends and colleagues.

I am Fang Yuan, and I'll see you in the next lesson!