10 Convolution How to Use Convolution to Give Computers Vision Part 2

10 Convolution How to Use Convolution to Give Computers Vision - Part 2 #

Hello, I am Fang Yuan.

After the previous lesson, I believe you already have some understanding of standard convolution calculations. Although standard convolution is usually capable of carrying the whole workload, people have still proposed some other convolution methods based on standard convolution. These different convolution methods can play different roles when dealing with different problems. Here, I have listed some for you.

In the previous lesson, we learned about the parameters in conv2d: in_channels, out_channels, kernel_size, stride, padding, and bias.

Among them, the remaining two parameters in conv2d in PyTorch correspond to two different types of convolution: depthwise separable convolution and dilated convolution. Let’s take a look together.

Depthwise Separable Convolution #

Let’s first take a look at the depthwise separable convolution implemented based on the groups parameter.

With the continuous development of deep learning technology, many deep and wide network models have been proposed, such as VGG, ResNet, SENet, DenseNet, etc. These networks utilize their complex structures to extract useful information more accurately. At the same time, with the continuous enhancement of hardware computing power, these complex models can be deployed directly on the server side, achieving excellent results in practical projects.

However, these models have a common problem: they are slow in speed and have a large number of parameters. These two problems make it impossible to directly deploy these models on mobile devices. Mobile applications are undoubtedly the hottest market today, and these deep and wide complex network models are not applicable in such scenarios.

Therefore, many researchers have turned their attention to seeking more lightweight models. These lightweight models need to be fast, have a small size, and allow for a slight decrease in accuracy compared to server-side models.

Depthwise separable convolution is a lightweight convolution proposed by Google in MobileNet v1. In simple terms, depthwise separable convolution requires less computation while achieving similar effects. Next, let’s take a look at how depthwise separable convolution is computed, and then compare how much computation has been reduced.

Depthwise separable convolution is composed of two parts: depthwise (DW) convolution and pointwise (PW) convolution.

Let’s first review the standard convolution and then explain how depthwise separable convolution, which produces the same output feature map, works.

Do you remember this figure? This is the standard convolution calculation method we discussed in the previous lesson. It describes the process of obtaining n feature maps, each with a channel size of \(h^{\\prime}\) and \(w^{\\prime}\), from m input feature maps with a size of h and w through convolution.

Let’s expand the process of computing the feature map with a convolution kernel. In a convolution kernel, m convolutions are performed separately on the m channel data of the input feature map, generating an intermediate result. Then, the m intermediate results are summed bitwise to obtain one feature map among the n output feature maps.

Depthwise (DW) Convolution #

What is DW convolution? DW convolution is a convolution with m convolution kernels, each with a channel size of 1. Each convolution kernel performs convolution operation on the corresponding channel data of the input feature map. Therefore, the output of DW convolution has m channels. Typically, the size of DW convolution kernels is 3x3.

The process of DW convolution is shown in the following figure:

Pointwise (PW) Convolution #

Generally speaking, the goal of depthwise separable convolution is to lighten the computation of standard convolution, so it can replace standard convolution. This also means that the output size of the original convolution computation should remain the same after replacement.

Therefore, in depthwise separable convolution, our ultimate goal is to obtain an output feature map with n channels. However, the DW convolution mentioned earlier clearly does not achieve this, and it also ignores the information between input feature map channels.

Therefore, after DW, we need to add a PW convolution. PW convolution is also known as pointwise convolution. The main function of PW convolution is to consider the output of m feature maps generated by DW and output a feature map with n channels.

In convolutional neural networks, we often see the use of 1x1 convolutions, which are mainly used for dimensionality increase or decrease. Therefore, the PW convolution after the output of DW is a 1x1 convolution with n convolution kernels, and each convolution kernel contains convolution data from m channels.

To help you understand the process I just described, I will still use a graphical representation to describe it. You can refer to the figure below:

Through the combination of DW and PW, we can obtain a lightweight convolution with the same output size as standard convolution. Since it is lightweight, let’s take a look at how much computational reduction deep separable convolution achieves compared to standard convolution.

Computational Cost #

Our original problem has an input feature map with m channels, a convolution kernel size of kxk, and an output feature map size of \((n, h^{\\prime}, w^{\\prime})\) After understanding the computation of standard convolution, we can calculate the computational cost as follows:

\[k \\times k \\times m \\times n \\times h^{\\prime} \\times w^{\\prime}\]

How did we get this result? You can think back from the output feature map.

In the above figure, the value of each point in the output feature map is calculated by n convolution kernels and the input feature map. The computational cost of this is \(k \\times k \\times m \\times n\). How many points are there in the output feature map? Yes, there are \(h^{\\prime} \\times w^{\\prime}\) points in total. Therefore, we naturally derive the above calculation method.

If we use depthwise separable convolution, the computational cost of DW is: \(k \\times k \\times m \\times h^{\\prime} \\times w^{\\prime}\), and the computational cost of PW is: \(1 \\times 1 \\times m \\times n \\times h^{\\prime} \\times w^{\\prime}\).

It is not difficult for us to conclude that the ratio of computational cost between standard convolution and depthwise separable convolution is: [ \frac{{k \times k \times m \times h’ \times w’ + 1 \times 1 \times m \times n \times h’ \times w’}}{{k \times k \times m \times n \times h’ \times w’}} ]

[ = \frac{1}{n} + \frac{1}{{k \times k}} ]

Therefore, the computational complexity of depth-wise separable convolution is approximately (\frac{1}{{k^2}}) times that of ordinary convolution.

So how is depth-wise separable convolution implemented in PyTorch?

Implementation in PyTorch #

To implement depth-wise separable convolution in PyTorch, we need to implement two separate convolutions: depth-wise (DW) and point-wise (PW). Let’s start by looking at DW convolution. To implement DW convolution, we need to use the groups parameter in nn.Conv2d. The groups parameter controls the grouping of input and output feature maps.

When groups is equal to 1, it is the standard convolution we discussed in the previous lesson, and groups=1 is also the default value for nn.Conv2d.

When groups is not equal to 1, the input feature map is divided into groups groups, each with its own corresponding convolutional kernel. The convolution is then performed on each group, and the output feature map is also grouped into groups. It is important to note that when groups is not equal to 1, groups must be divisible by in_channels and out_channels.

When groups is equal to in_channels, it is the DW convolution.

Now, let’s get hands-on and see how to implement a DW convolution. First, let’s generate a 5x5 input feature map with three channels, and then apply depth-wise separable convolution to output a feature map with four channels.

The implementation code for DW convolution is as follows:

import torch
import torch.nn as nn

# Generate a 3-channel 5x5 feature map
x = torch.rand((3, 5, 5)).unsqueeze(0)
print(x.shape)
# Output:
# torch.Size([1, 3, 5, 5])
# Note that in DW convolution, the number of input and output channels is the same
in_channels_dw = x.shape[1]
out_channels_dw = x.shape[1]
# Generally, the kernel size of DW convolution is 3
kernel_size = 3
stride = 1
# The 'groups' parameter in DW convolution is the same as the number of input channels
dw = nn.Conv2d(in_channels_dw, out_channels_dw, kernel_size, stride, groups=in_channels_dw)

Please note the following points:

In DW convolution, the number of input and output channels is the same.
Generally, the kernel size of DW convolution is 3x3.
The ‘groups’ parameter in DW convolution is the same as the number of output channels.

Great, we have finished writing the implementation of DW convolution. Next is the implementation of PW convolution. In fact, the implementation of PW convolution is the same as the standard convolution we introduced in the previous lesson, except that the kernel size is 1x1. It is important to note that the ‘groups’ parameter in PW convolution is set to its default value.

The specific code is as follows:

in_channels_pw = out_channels_dw
out_channels_pw = 4
kernel_size_pw = 1
pw = nn.Conv2d(in_channels_pw, out_channels_pw, kernel_size_pw, stride)
out = pw(dw(x))
print(out.shape)

Okay, we have finished the discussion on groups and depth-wise separable convolution. Next, let’s look at the final parameter, dilation, which is used to implement dilated convolution.

Dilated Convolution #

Dilated convolution is often used in image segmentation tasks. The goal of image segmentation is to achieve pixel-wise output, which means that the model needs to make predictions for each pixel in the image.

For an image segmentation model, multiple layers of convolution are usually used to extract features. As the number of layers increases, the receptive field also becomes larger. Here is a new term called “receptive field” which I will explain later. Let’s first talk about the purpose of dilated convolution.

However, there is a problem with image segmentation models. After multiple layers of convolution and pooling operations, the feature map becomes smaller. In order to have predictions for each pixel, we need to upsample or perform deconvolution on the smaller feature map to enlarge it to a certain scale and then make predictions.

It is worth noting that restoring a smaller feature map to a larger one obviously leads to information loss, especially for smaller objects, which is difficult to recover. So the question is, can we achieve a larger receptive field without reducing the size of the feature map?

I guess you can already guess that dilated convolution is the key to solving this problem. Its biggest advantage is that it does not require reducing the feature map size while achieving a larger receptive field.

Receptive Field #

Now let me explain what a receptive field is. Receptive field is a concept often seen in the field of computer vision.

Due to the repeated pooling (which is an operation in convolutional neural networks that takes the maximum or average value in a certain region of the feature map, replacing all the data in that region with the maximum or average value), or convolution operations in convolutional neural networks, the feature maps in different layers become smaller.

This means that in a convolutional neural network, different layers of feature maps have different calculation areas relative to the original image, and this area is called the receptive field. A larger receptive field indicates more comprehensive and abstract semantic information, while a smaller receptive field indicates more detailed semantic information.

It may be difficult to understand only theoretically, so let’s look at an example to make it clearer. Please refer to the following image. The original image is a 6x6 image, and the first convolutional layer is 3x3. The receptive field it outputs is 3 because each value in the output feature map is calculated from a 3x3 region in the original image.

Now, let’s take a look at the second convolutional layer, which is also a 3x3 convolution, and the output is a 2x2 feature map. At this time, the receptive field of the second convolutional layer will become 5 (the blue and orange parts in the input feature map).

With the help of these illustrations, I believe you can easily understand the meaning of receptive field.

Calculation Method #

Okay, let’s take a look at how dilated convolution is calculated.

Describing the calculation method of dilated convolution in words is a bit abstract, so let’s take a look at its dynamic demonstration (there are various dynamic illustrations of convolution calculations in this GitHub, which is very intuitive, so let’s use it to learn about dilated convolution).

First, let’s see how standard convolution, which we talked about in the previous lesson, is calculated.

Comparing to the image above, the blueprints below represent the input feature map, the sliding shadow represents the convolutional kernel, and the green color represents the output feature map.

Now let’s take a look at the illustrative image of dilated convolution.

Combining these illustrations, we can see that the calculation method is the same as ordinary convolution, except that the convolutional kernel is split by a certain ratio. The implementation is done by padding the convolutional kernel with zeros.

This ratio of separation is generally referred to as the dilation rate, which is the dilation parameter in Conv2d.

The dilation parameter defaults to 1 and can also be an int or a tuple. When it is a tuple, the first element represents the row information, and the second element represents the column information.

Summary #

Congratulations on completing today’s lesson. Today, we learned two special convolutions, Depthwise Separable Convolution and Dilated Convolution, while implementing PyTorch convolution operations.

For Dilated Convolution, the most important concepts you need to grasp are receptive field and the calculation method of dilated convolution. The receptive field refers to the area that can be reflected in the original image.

Depthwise Separable Convolution is mainly used in lightweight models, while Dilated Convolution is mainly used in image segmentation tasks.

Here is my experience to share with you: If you need to make your model lightweight, smaller, and faster, you can consider replacing convolution layers with Depthwise Separable Convolution. If you are working on an image segmentation project, you can consider replacing the later layers of the network with Dilated Convolution to see if it improves the performance.

Finally, let’s summarize the important parameters of convolution operations in PyTorch. I have summarized them in a table format for you, which you can use as a handy reference tool.

Practice for Each Lesson #

Generate a randomly 3-channel feature map with a size of 128x128. Then create a depthwise separable convolution with 10 convolution kernels of size 3x3 (DW convolution) to perform convolutional calculations on the input data.

Feel free to leave a message in the comments section to interact with me. I also encourage you to share today’s content with more colleagues and friends.