19 Image Segmentation Explaining the Principles and Models of Image Segmentation Part 1

19 Image Segmentation Explaining the Principles and Models of Image Segmentation - Part 1 #

Hello, I am Fang Yuan.

In the previous two lessons, we have completed the learning and practice of image classification. Today, let’s explore another very important application of computer vision: image segmentation.

You must have used or heard of products like Tencent Meeting or Zoom, right? During meetings, we can choose to replace the background, as shown in the following image:

Huawei phones also used to have a feature called “color retention for portrait photos.”

The implementation behind these applications relies on image segmentation, which we will discuss today.

We will learn about the segmentation principles in this lesson. In the next lesson, we will apply these skills to practical scenarios and build an image segmentation model from scratch.

Image Segmentation #

Let’s start by understanding the concept of image segmentation from a comparative perspective. Image classification is the process of automatically categorizing a given image into a particular class, whereas image segmentation involves categorizing every pixel in the image.

Image segmentation can be divided into semantic segmentation and instance segmentation. The difference between the two lies in the fact that semantic segmentation only requires classifying each pixel, without distinguishing whether they belong to the same instance. On the other hand, instance segmentation not only requires pixel classification but also determining which instance they belong to.

As shown in the image below, the left side represents semantic segmentation, while the right side represents instance segmentation. In these two classes, we will focus on explaining semantic segmentation.

Principle of Semantic Segmentation #

The principle of semantic segmentation is actually similar to image classification, with two main differences. The first difference lies in the classification part (which I personally call “classification end”, referring to the part that performs classification after feature extraction using convolution). The second difference lies in the network structure. Let’s start by looking at the first difference, which is the difference in the classification end.

Classification End #

Let’s first recall the principle of image classification. You can understand it by referring to the diagram below.

After the input image goes through the convolutional layers to extract features, it will generate several feature maps. Then, a fully connected layer (the red circles in the above diagram) is connected to these feature maps. The number of nodes in the fully connected layer corresponds to the number of classes to which the image should be classified. We pass the output of the fully connected layer to the softmax function to obtain the probabilities for each category, and then we can determine which category the input image belongs to based on these probabilities.

In image segmentation, we also use convolutional layers to extract features and generate multiple feature maps. However, the number of feature maps generated in the end corresponds to the number of classes for segmentation. Let’s take an example to illustrate how to determine the category of each pixel. Let’s say we want to segment the input image of a cat into two categories: the cat and the background, as shown in the figure below.

In the final two feature maps, channel 1 represents the information of the cat, while channel 2 corresponds to the information of the background.

Here’s another example to explain how to determine the category of each pixel. Suppose that at position (0,0) in channel 1, the output is 2, and at position (0,0) in channel 2, the output is 30.

After applying softmax to convert them to probabilities, the probability at position (0, 0) in channel 1 is 0, while the probability at position (0, 0) in channel 2 is 1. From these probabilities, we can determine that at position (0, 0), it belongs to the background, not the cat.

Network Structure #

In a segmentation network, the size of the final output feature maps either matches the size of the input image or is close to it.

The reason for this is that we want to make a prediction for each pixel in the original image. When the output feature maps have the same size as the input image, we can directly perform segmentation based on them. When the output feature maps have a different size from the input image, we need to resize the output feature maps to the size of the input image.

If we resize a relatively small feature map to a larger size, we will inevitably lose some information. Therefore, the size of the output feature maps should not be too small.

This is also the second difference between image segmentation networks and image classification networks. In image classification, after multiple layers of feature extraction, the final feature maps are usually small. However, in image segmentation, the final feature maps are usually close to the size of the original image.

As mentioned earlier, feature extraction in image segmentation is also done through convolution. According to the previous theory of feature extraction, the size of the feature maps decreases. If we consider feature extraction as the “Encoder”, then in image segmentation, there is another step called the “Decoder”.

The role of the Decoder is to restore the size of the feature maps to a larger size. This restoration corresponds to upsampling. In upsampling, we usually use transpose convolution.

Transpose Convolution #

Next, let’s investigate the calculation principle of transpose convolution, which is the key content of this lesson.

Let’s look at the convolution calculation diagram below, with padding set to 0 and stride set to 1.

From previous learning, we know that the convolution operation is a many-to-one operation, and each output y is related to 4 input x. In fact, the transpose convolution is a reverse process of the convolution in logic, rather than the inverse operation of the convolution.

In other words, the transpose convolution does not use the output Y in the diagram above to obtain the input X with the convolution kernel Kernel. The transpose convolution can only restore a feature map with the same size as the input feature map.

We denote the convolution kernel in the transpose convolution as k’. Thus, a y is restored with four k’s as shown below:

The process of restoring the size is shown below, where each restored result in the diagram corresponds to the original 3x3 input.

By observing, you can find that some parts are overlapping. For the overlapping parts, add them together, and the final restored feature map is as follows:

Arrange the results in the above diagram slightly, and arrange them into the following results. No special processing is done, just adding some zeros:

From the above results, we can obtain the following through the convolution: -

Have you noticed something magical? The calculation of the transpose convolution becomes the calculation of the convolution again.

Therefore, let’s summarize the calculation process of the transpose convolution:

Pad zeros to the input feature map. - 2. Change the transpose convolution’s convolution kernel up and down, left and right to create a new convolution kernel. - 3. Use the new convolution kernel to perform convolution operation on the basis of 1 with stride 1 and padding 0.

Let’s first look at the transpose convolution in PyTorch and its main parameters, and then explain how step 1 pads zeros based on the parameters.

class torch.nn.ConvTranspose2d(in_channels, 
                               out_channels, 
                               kernel_size, 
                               stride=1, 
                               padding=0,
                               groups=1,
                               bias=True,
                               dilation=1)

Among them, in_channels, out_channels, kernel_size, groups, bias, and dilation have the same meanings as the parameters we discussed in the convolution section (you can refer to the 9th and 10th lessons of convolution), so we won’t repeat them here.

First, let’s take a look at stride. Because the transpose convolution is a reverse process of convolution, the stride here refers to the stride on the original image.

In the example we just mentioned, the stride is equal to 1. If it is equal to 2, according to the same routine, it can be converted into the following convolution transformation. At the same time, we can also conclude that in the first step mentioned above, zero padding is to add stride-1 zeros between the rows and columns of the input feature map.

Let’s take a look at the padding operation. Padding means adding dilation * (kernel_size - 1) - padding circles of zeros around the input feature map. Here, the dilation parameter is used, but dilation and groups parameters are not commonly used in transpose convolution.

The above describes the zero-padding operation of transpose convolution. With both images and text, I believe you can understand it.

Based on the above explanation, we can derive the relationship between the output feature map size and the input feature map size:

[h_{out} = (h_{in} - 1) * \text{{stride}}[0] - \text{{padding}}[0] + \text{{kernel_size}}[0]]

[w_{out} = (w_{in} - 1) * \text{{stride}}[1] - \text{{padding}}[1] + \text{{kernel_size}}[1]]

Now, let’s use code to verify if our description of transpose convolution is correct.

We have the input feature map input_feat:

import torch
import torch.nn as nn
import numpy as np
input_feat = torch.tensor([[[[1, 2], [3, 4]]]], dtype=torch.float32)
input_feat
Output:
tensor([[[[1., 2.],
          [3., 4.]]]])

The kernel k:

kernels = torch.tensor([[[[1, 0], [1, 1]]]], dtype=torch.float32)
kernels
Output:
tensor([[[[1., 0.],
          [1., 1.]]]])

Using a stride of 1 and padding of 0, the transpose convolution is:

convTrans = nn.ConvTranspose2d(1, 1, kernel_size=2, stride=1, padding=0, bias=False)
convTrans.weight = nn.Parameter(kernels)

According to our description, the first step is to perform zero-padding on the input feature map:

[input_feat = \begin{bmatrix} 0 & 0 & 0 & 0 \ 0 & 1 & 2 & 0 \ 0 & 3 & 4 & 0 \ 0 & 0 & 0 & 0 \end{bmatrix}]

Then, perform convolution with the transformed kernel:

[output = \begin{bmatrix} 1 & 2 & 0 \ 4 & 7 & 2 \ 3 & 7 & 4 \end{bmatrix}]

Let’s see the output from the code:

convTrans(input_feat)
Output:
tensor([[[[1., 2., 0.],
          [4., 7., 2.],
          [3., 7., 4.]]]], grad_fn=<SlowConvTranspose2DBackward>)

Does it match?

Loss Function #

After discussing the network structure, let’s move on to another topic in image segmentation: the loss function.

In image segmentation, we can still use the cross-entropy loss that is commonly used in image classification. In image classification, there is one predicted result for each image, and the loss can be calculated based on the predicted result and the true label. In image segmentation, the true label is a 2D feature map that records the true classification result for each pixel. In segmentation, the feature map containing pixel classes is generally called a mask.

Let’s explain with an example of a small cat image. When marking the cat in the following image, the marked region becomes the ground truth (GT) mask.

GT stands for Ground Truth, which is commonly used in image segmentation. In image classification, it corresponds to the true label of the data, while in image segmentation, GT represents the true classification of each pixel, as shown in the example below.

The GT is as follows:

In the predicted mask output by our model, there is a predicted result for each position. This predicted result is compared with the mask from the GT, and then a loss is calculated.

Of course, in image segmentation, we can use not only cross-entropy loss but also more targeted Dice Loss. I will further explain this in the next lesson.

Public Datasets #

As we saw earlier, image annotation in image segmentation is time-consuming. We will explore how to annotate images for semantic segmentation in the next lesson through practical examples.

In addition, there are many authoritative and high-quality public datasets in the industry. The most famous one is COCO, which can be accessed at the following link: https://cocodataset.org/#detection-2016. It contains 80 categories and over 20,000 images. If you are interested, you can try training with it after the lesson.

Summary #

Congratulations on completing today’s learning.

Today, we first clarified what problem semantic segmentation solves and how it classifies each pixel in an image.

Then, we compared the principles of image classification and explained the principles of semantic segmentation. There are two main differences between semantic segmentation and image classification:

The difference lies in the classification end. In image classification, after the extraction of features through convolution, the final output is in the form of several neurons, where each neuron represents the judgment for a specific category. In semantic segmentation, multiple feature maps are output, with each feature map representing a corresponding category.
In the network for image classification, the feature maps become smaller as the process goes on. However, in the network for semantic segmentation, there is a decoder step that enlarges the feature maps. The method used to implement the decoder is called upsampling, and the most commonly used technique is transpose convolution.

When it comes to transpose convolution, besides understanding how it is calculated, the most important thing to remember is that it is not the reverse operation of convolution, but a convolution operation that can enlarge the size of feature maps.

Practice for Each Lesson #

For the cat segmentation problem in this article, is it possible to only output one feature map?

Feel free to comment and interact with me in the comment section. I also recommend you share this lesson with more colleagues and friends.