20 Image Segmentation How to Build an Image Segmentation Model Part 2

20 Image Segmentation How to Build an Image Segmentation Model - Part 2 #

Hello, I’m Fang Yuan.

In the previous lesson, we learned the theoretical knowledge of image segmentation. Are you eager to get your hands on it and experience it for yourself?

Today, let’s start from scratch and complete an image segmentation project. The project involves performing semantic segmentation on images of kittens. To achieve this project, I will introduce a simple but practical network architecture called UNet. Through this lesson, you will not only experience the complete process of implementing a machine learning model once again, but also train an actual semantic segmentation model.

You can download the course code from here.

Data #

Let’s start with the three essentials of machine learning development: data, training, and evaluation. First, we need to prepare the data. We will start by labeling the training data and then proceed with data loading.

Labeling the Segmented Image #

As mentioned earlier, preparing data for image segmentation is more complex compared to image classification. So, how do we label the images required for semantic segmentation? In image segmentation, each image we use must have a corresponding mask, as shown below:

As we discussed in the previous lesson, a mask is a feature map that contains pixel class labels. Based on the example images here, we can see that the mask is an image that corresponds to the original image, and each position in the mask records the pixel class label corresponding to each position in the original image. To label the mask, we need to use the Labelme tool.

The labeling process consists of seven steps, which we will go through one by one.

Step 1: Download and install Labelme. Follow the installation instructions on Github. If the installation is slow, you can use a mirror site in China (e.g., Tsinghua) for faster installation.

Step 2: Put the images to be labeled in a folder, as shown below. Here, I have put all images of cats into the “cats” folder.

Step 3: Prepare a “label.txt” file in advance, with each line representing the category to be labeled. Here is an example of my “label.txt” file:

__ignore__
_background_
cat

Note that it is recommended to write the first two lines as shown above. Failure to do so will result in an error when using “label2voc.py” for conversion. However, “label2voc.py” is not the only way to convert the data (you can also use “labelme_json_to_dataset”), but I recommend using “label2voc.py”. Starting from the third line, you specify the categories to be labeled.

Step 4: Run the following command to automatically start Labelme.

labelme --labels labels.txt --nodata

Step 5: Click on “Open Dir” on the left side and select the folder from step 2. The images to be labeled will be automatically imported. After selecting the file to be labeled in the lower right corner, it will be displayed automatically, as shown in the following image:

Step 6: Click on “Create Polygons” on the left side to start labeling. The labeling process involves enclosing the cat along its boundaries. When a closed loop is formed, Labelme will prompt you to enter the category. Select the “cat” category.

After successful labeling, the result will look like the image below:

After completing the labeling, we need to save the data. This will generate a labeled JSON file, as shown below:

fangyuan@geektime data $ ls cats
1.jpeg 1.json 10.jpeg 10.json 2.jpeg 3.jpeg 4.jpeg 4.json

Step 7: Run the following code to convert the labeled data into masks:

python label2voc.py cats cats_output --label label.txt

The “label2voc.py” script used in the above code is available at the following link: https://github.com/wkentaro/labelme/blob/main/examples/semantic_segmentation/labelme2voc.py.

In the code, “cats” refers to the labeled data, and “cats_output” is the output folder. Under “cats_output,” four folders will be automatically generated. We only need two folders, namely “JPEGImages” (for the original training images) and “SegmentationClassPNG” (for the converted masks).

With this, our data preparation is complete. I have labeled a total of 8 images, as shown below. However, in real projects, a large number of labeled images are required. Here, it is mainly for demonstration purposes.

With this, the labeling process is complete.

Data Loading #

After completing the labeling process, we need to load the data in PyTorch. We will write the code related to data in a file called “dataset.py”. The implementation will be similar to what we discussed earlier: inheriting the Dataset class and implementing the init, len, and getitem methods.

Below is the code in “dataset.py” with comments explaining each part. By combining the comments, you should be able to understand the meaning of the code easily.

import os
import torch
import numpy as np

from torch.utils.data import Dataset
from PIL import Image

class CatSegmentationDataset(Dataset):

    # The input of the model is 3-channel data
    in_channels = 3
    # The output of the model is 1-channel data
    out_channels = 1

    def __init__(
        self,
        images_dir,
        image_size=256,
    ):

        print("Reading images...")
        # The location of the original images
        image_root_path = images_dir + os.sep + 'JPEGImages'
        # The location of the masks
        mask_root_path = images_dir + os.sep + 'SegmentationClassPNG'
        
        # Store the images and masks in image_slices and mask_slices respectively after reading
        self.image_slices = []
        self.mask_slices = []
        
        for im_name in os.listdir(image_root_path):
            # The names of the original images and masks are the same, except for the suffix
            mask_name = im_name.split('.')[0] + '.png' 

            image_path = image_root_path + os.sep + im_name
            mask_path = mask_root_path + os.sep + mask_name

            im = np.asarray(Image.open(image_path).resize((image_size, image_size)))
            mask = np.asarray(Image.open(mask_path).resize((image_size, image_size)))
            self.image_slices.append(im / 255.)
            self.mask_slices.append(mask)

    def __len__(self):
        return len(self.image_slices)

    def __getitem__(self, idx):

        image = self.image_slices[idx] 
        mask = self.mask_slices[idx] 

        # The order of the tensor is (Batch_size, channels, height, width), while the order of numpy after reading is (height, width, channels)
        image = image.transpose(2, 0, 1)
        # The mask is single-channel data, so we need to add another dimension
        mask = mask[np.newaxis, :, :]

        image = image.astype(np.float32)
        mask = mask.astype(np.float32)

        return image, mask

Next, our training code is written in train.py, and the main function is the main function. In the main function, we will call data_loaders to load the data. The code is as follows:

import torch

from torch.utils.data import DataLoader 
from torch.utils.data import DataLoader
from dataset import CatSegmentationDataset as Dataset

def data_loaders(args):
    dataset_train = Dataset(
        images_dir=args.images,
        image_size=args.image_size,
    )

    loader_train = DataLoader(
        dataset_train,
        batch_size=args.batch_size,
        shuffle=True,
        num_workers=args.workers,
    )

    return loader_train

# args is the passed parameter
def main(args):
    loader_train = data_loaders(args)

These are all the contents of data preprocessing. Next, let’s take a look at the content of the model training part.

Model Training #

Let’s start by reviewing the three essential components of model training: network architecture, loss function, and optimization method.

Let’s begin with network architecture. Today, I want to introduce a semantic segmentation network called UNet.

Network Architecture: UNet #

UNet is a very practical network. It is a typical Encoder-Decoder type segmentation network with a simple structure, as shown in the following diagram.

Although its network structure is simple, its performance is not. I have compared it with other mainstream semantic segmentation networks in many projects, and UNet has consistently achieved excellent results.

The overall network structure is the same as the conceptual diagram given in the paper. Let’s focus on a few implementation details.

First, the horizontal blue arrows in the diagram represent repeated structures formed by two 3x3 convolutional layers. After each convolutional layer, a Batch Normalization layer and a ReLU activation layer follow. As explained in Lesson 14, these repetitive structures can be extracted separately. Let’s create a unet.py file to define the network structure.

Create a Block class in unet.py to define the repeated convolution blocks mentioned earlier:

class Block(nn.Module):

    def __init__(self, in_channels, features):
        super(Block, self).__init__()

        self.features = features
        self.conv1 = nn.Conv2d(
                            in_channels=in_channels,
                            out_channels=features,
                            kernel_size=3,
                            padding='same',
                        )
        self.conv2 = nn.Conv2d(
                            in_channels=features,
                            out_channels=features,
                            kernel_size=3,
                            padding='same',
                        )

    def forward(self, input):
        x = self.conv1(input)
        x = nn.BatchNorm2d(num_features=self.features)(x)
        x = nn.ReLU(inplace=True)(x)
        x = self.conv2(x)
        x = nn.BatchNorm2d(num_features=self.features)(x)
        x = nn.ReLU(inplace=True)(x)

        return x

Note that within the same block, the size of the feature map remains unchanged, so the padding is set to ‘same’.

Second, the green upward arrows represent the upsampling process. This is implemented using transposed convolution, as discussed in the previous lesson.

Finally, we want to segment the cat in the image, which means there are two classes - cat and background. For a binary classification problem like this, we can directly output a feature map and use probabilities to determine whether it is a positive example (cat) or a negative example (background), as shown in line 71 of the code below. The following code also completes all the code within unet.py:

import torch
import torch.nn as nn

class Block(nn.Module):
    ...

class UNet(nn.Module):

    def __init__(self, in_channels=3, out_channels=1, init_features=32):
        super(UNet, self).__init__()

        features = init_features
        self.conv_encoder_1 = Block(in_channels, features)
        self.conv_encoder_2 = Block(features, features * 2)
        self.conv_encoder_3 = Block(features * 2, features * 4)
        self.conv_encoder_4 = Block(features * 4, features * 8)

        self.bottleneck = Block(features * 8, features * 16)

        self.upconv4 = nn.ConvTranspose2d(
            features * 16, features * 8, kernel_size=2, stride=2
        )
        self.conv_decoder_4 = Block((features * 8) * 2, features * 8)
        self.upconv3 = nn.ConvTranspose2d(
            features * 8, features * 4, kernel_size=2, stride=2
        )
        self.conv_decoder_3 = Block((features * 4) * 2, features * 4)
        self.upconv2 = nn.ConvTranspose2d(
            features * 4, features * 2, kernel_size=2, stride=2
        )
        self.conv_decoder_2 = Block((features * 2) * 2, features * 2)
        self.upconv1 = nn.ConvTranspose2d(
            features * 2, features, kernel_size=2, stride=2
        )
        self.decoder1 = Block(features * 2, features)

        self.conv = nn.Conv2d(
            in_channels=features, out_channels=out_channels, kernel_size=1
        )

    def forward(self, x):
        conv_encoder_1_1 = self.conv_encoder_1(x)
        conv_encoder_1_2 = nn.MaxPool2d(kernel_size=2, stride=2)(conv_encoder_1_1)

        conv_encoder_2_1 = self.conv_encoder_2(conv_encoder_1_2)
        conv_encoder_2_2 = nn.MaxPool2d(kernel_size=2, stride=2)(conv_encoder_2_1)

        conv_encoder_3_1 = self.conv_encoder_3(conv_encoder_2_2)
        conv_encoder_3_2 = nn.MaxPool2d(kernel_size=2, stride=2)(conv_encoder_3_1)

        conv_encoder_4_1 = self.conv_encoder_4(conv_encoder_3_2)
        conv_encoder_4_2 = nn.MaxPool2d(kernel_size=2, stride=2)(conv_encoder_4_1)

Note: Code continues, but has been truncated due to character limit. bottleneck = self.bottleneck(conv_encoder_4_2)

    conv_decoder_4_1 = self.upconv4(bottleneck)
    conv_decoder_4_2 = torch.cat((conv_decoder_4_1, conv_encoder_4_1), dim=1)
    conv_decoder_4_3 = self.conv_decoder_4(conv_decoder_4_2)

    conv_decoder_3_1 = self.upconv3(conv_decoder_4_3)
    conv_decoder_3_2 = torch.cat((conv_decoder_3_1, conv_encoder_3_1), dim=1)
    conv_decoder_3_3 = self.conv_decoder_3(conv_decoder_3_2)

    conv_decoder_2_1 = self.upconv2(conv_decoder_3_3)
    conv_decoder_2_2 = torch.cat((conv_decoder_2_1, conv_encoder_2_1), dim=1)
    conv_decoder_2_3 = self.conv_decoder_2(conv_decoder_2_2)

    conv_decoder_1_1 = self.upconv1(conv_decoder_2_3)
    conv_decoder_1_2 = torch.cat((conv_decoder_1_1, conv_encoder_1_1), dim=1)
    conv_decoder_1_3 = self.decoder1(conv_decoder_1_2)

    return torch.sigmoid(self.conv(conv_decoder_1_3))


Here, we have constructed the network structure. Now let's take a look at the loss function.

Loss Function: Dice Loss #

Here, we will discuss a commonly used loss function in semantic segmentation - the Dice Loss.

To understand how this loss function is derived, you first need to understand an evaluation metric for semantic segmentation (although mIoU is the most common one, which we will discuss later), called the Dice coefficient. It is commonly used to calculate the similarity between two sets and takes values between 0 and 1.

The formula for the Dice coefficient is as follows:

\[Dice=\\frac{2|P\\cap G|}{|P|+|G|}\]

Where \(|P\\cap G|\) is the number of intersecting elements between set P and set G, and \(|P|\) and \(|G|\) represent the number of elements in sets P and G, respectively. The coefficient 2 in the numerator is used to offset the common elements between P and G in the denominator. For semantic segmentation tasks, set P represents the predicted mask, and set G represents the ground truth mask.

Based on the Dice coefficient, we can design a loss function called the Dice Loss. Its formula is very simple and is shown below:

\[Dice Loss=1-\\frac{2|P\\cap G|}{|P|+|G|}\]

From the formula, we can see that the more similar the predicted mask is to the ground truth, the smaller the loss; the greater the difference between the predicted mask and the ground truth, the larger the loss.

For binary classification problems, the ground truth (GT) only has 0 and 1 as values. When we directly use the predicted probabilities output by the model instead of using a threshold to convert them into a binary mask, this loss function is called the Soft Dice Loss. In this case, the value of \(|P\\cap G|\) is approximately equal to the dot product between the GT and the predicted probability matrix.

The code for defining the loss function is as follows:

import torch.nn as nn

class DiceLoss(nn.Module):
    def __init__(self):
        super(DiceLoss, self).__init__()
        self.smooth = 1.0

    def forward(self, y_pred, y_true):
        assert y_pred.size() == y_true.size()
        y_pred = y_pred[:, 0].contiguous().view(-1)
        y_true = y_true[:, 0].contiguous().view(-1)
        intersection = (y_pred * y_true).sum()
        dsc = (2. * intersection + self.smooth) / (
            y_pred.sum() + y_true.sum() + self.smooth
        )
        return 1. - dsc

Here, self.smooth is a smoothing value used to prevent division by zero cases in the numerator and denominator.

Training Workflow #

Finally, we connect the model, loss function, and optimization method together to see the overall training workflow. The training code is as follows:

def main(args):
    makedirs(args)
    # Choose CPU or GPU based on the availability of CUDA
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    # Load the training data
    loader_train = data_loaders(args)
    # Instantiate the UNet model
    unet = UNet(in_channels=Dataset.in_channels, out_channels=Dataset.out_channels)
    # Move the model to the device (GPU or CPU)
    unet.to(device)
    # Loss function
    dsc_loss = DiceLoss()
    # Optimization method
    optimizer = optim.Adam(unet.parameters(), lr=args.lr)

    loss_train = []
    step = 0
    # Train for n epochs
    for epoch in tqdm(range(args.epochs), total=args.epochs):
        unet.train()
        for i, data in enumerate(loader_train):
            step += 1
            x, y_true = data
            x, y_true = x.to(device), y_true.to(device)
            y_pred = unet(x)
            optimizer.zero_grad()
            loss = dsc_loss(y_pred, y_true)
            loss_train.append(loss.item())
            loss.backward()
            optimizer.step()
            if (step + 1) % 10 == 0:
                print('Step ', step, 'Loss', np.mean(loss_train))
                loss_train = []
    torch.save(unet, args.weights + '/unet_epoch_{}.pth'.format(epoch))

It is important to note the points I have explained in the comments, so you can take a look yourself. Essentially, it involves the three key steps of model training: data loading, constructing the network, and iteratively updating the network parameters.

I trained the model with the training data for several epochs and saved multiple models in the .pth format. With this, we have completed the entire process of model training. Now, we can use the saved models for predictions and see how the segmentation results turn out.

Model Prediction #

Now we are going to use the trained model for semantic segmentation and see what the results look like.

The code for model prediction is as follows.

import torch
import numpy as np

from PIL import Image

img_size = (256, 256)
# Load the model
unet = torch.load('./weights/unet_epoch_51.pth')
unet.eval()
# Load and process the input image
ori_image = Image.open('data/JPEGImages/6.jpg')
im = np.asarray(ori_image.resize(img_size))
im = im / 255.
im = im.transpose(2, 0, 1)
im = im[np.newaxis, :, :]
im = im.astype('float32')
# Model prediction
output = unet(torch.from_numpy(im)).detach().numpy()
# Convert the model output to a mask image
output = np.squeeze(output)
output = np.where(output > 0.5, 1, 0).astype(np.uint8)
mask = Image.fromarray(output, mode='P')
mask.putpalette([0, 0, 0, 0, 128, 0])
mask = mask.resize(ori_image.size)
mask.save('output.png')

This code is also easy to understand. First, we load the model using the torch.load function. Then we load and process an input image for segmentation. Next, we pass the processed data into the model to get the predicted values output. Finally, we convert the predicted values into a visualized mask image for saving.

The input image is the image to be segmented, as shown in the left image below. The final output, which is the visualized mask image, is shown in the right image below.

In the process of converting the predicted values to a mask image, the threshold of 0.5 is used. The pixels with values exceeding the threshold are set to 1 in the output matrix, representing the cat area. The pixels with values below the threshold are set to 0 in the output matrix, representing the background area.

To output the output matrix as a visualized image, we use the Image.fromarray function to convert the Numpy array to the Image format, with the mode set as “P” for pallet mode. Then we use the putpalette function to colorize the Image object.

The putpalette function takes a list as its parameter: [0, 0, 0, 0, 128, 0]. The first three numbers in the list represent the RGB values of pixels with a value of 0 ([0, 0, 0] represents black), and the last three numbers represent the RGB values of pixels with a value of 1 ([0, 128, 0] represents green). In this way, the saved mask image has black representing the background area, and green representing the cat area.

However, such separate contour images may not allow us to intuitively see the effect of semantic segmentation. So we combine the original image and the mask to see the effect. The specific code is as follows.

image = ori_image.convert('RGBA')
mask = mask.convert('RGBA')
# Merge the images
image_mask = Image.blend(image, mask, 0.3)
image_mask.save("output_mask.png")

First, we convert the original image image and the mask image to the “RGBA” mode with transparency. Then we use the Image.blend function to merge the two images into one image, with the last parameter 0.3 indicating that the transparency of the mask image is 30% and the transparency of the original image is 70%. The final result is shown in the image below.

This way, we can intuitively see where the predictions are inaccurate.

Model Evaluation #

In semantic segmentation, the commonly used evaluation metric is mIoU (mean Intersection over Union). mIoU represents the average intersection over union ratio. The intersection over union ratio is the ratio of the intersection and union of the true values and predicted values.

The true values refer to the masks we just annotated with labelme, also known as Ground Truth (GT). As shown in the left image below.

The predicted values are the masks predicted by the model, represented as Prediction. As shown in the right image below.

The intersection refers to the intersection between the true values and predicted values, as shown in the yellow region in the image below. The union refers to the union between the true values and predicted values, as shown in the blue region in the image below.

From the above images, it is easy for us to understand mIoU. The formula for mIoU is shown below.

\[mIoU=\\frac{1}{k}\\sum\_{i=1}^{k}{\\frac{P\\cap G}{P\\cup G}}\]

Where k is the number of all categories. In our example, there is only one category, “cat”, so k is 1. We usually do not include the background in mIoU calculation. P represents the predicted values, and G represents the true values.

Conclusion #

Congratulations on completing today’s learning task. In this lesson, we worked together on a practical image segmentation project.

Firstly, I introduced you to the data preparation for image segmentation, which involves using the Labelme tool to label the images. The quality of the data directly affects the quality of the final model, so it is important to carefully annotate the data. After completing the labeling using Labelme, we can use label2voc.py to convert the JSON files into masks.

Next, we learned about a highly efficient and practical model called UNet, and implemented its network structure using PyTorch.

Then, I explained the evaluation metric mIoU and the loss function Dice Loss for image segmentation.

The formula for mIoU is as follows:

\[mIoU=\\frac{1}{k}\\sum\_{i=1}^{k}{\\frac{P\\cap G}{P\\cup G}}\]

mIoU measures the overlap between the predicted results and the ground truth from the perspective of segmentation model performance. It is a commonly used evaluation metric in image segmentation.

Finally, we used the trained model to make predictions and visualized the segmentation results. I believe that through the previous image classification project and today’s image segmentation project, you will gain a deeper understanding of image processing.

Practice Exercise for Each Lesson #

You can try to build an image segmentation model yourself based on today’s content, and then test its performance using an image.

Feel free to discuss and communicate with me in the comments section. I also encourage you to share today’s content with more colleagues and friends. See you in the next class.