one of the Codia AI Design technologies: image segmentation model

happyer - Feb 26 - - Dev Community

1. Preface

The articles Codia AI: Shaping the Design and Code Revolution of 2024 and Codia AI: Shaping the Design and Code Revolution of 2024 - Part 2 introduced Codia AI. This article will focus on discussing the image segmentation model.

Image segmentation is a fundamental task in computer vision, with application areas including scene understanding, medical image analysis, robotic perception, video surveillance, augmented reality, and image compression, among others. Image segmentation is the technology and process of dividing an image into several specific regions with unique properties and extracting objects of interest. It is a key step from image processing to image analysis. Compared to image classification and detection, segmentation is a more refined task, as it requires classifying each pixel.

Image segmentation can be represented as a pixel classification problem with semantic labels (Semantic Segmentation) or as a segmentation problem for individual objects (Instance Segmentation). Semantic segmentation involves pixel-level labeling of a set of object categories (such as people, cars, trees, sky) in all image pixels, simply classifying each pixel in the image. Instance segmentation further extends the scope of semantic segmentation, requiring the detection and delineation of each object of interest in the image (for example, individual segmentation), that is, distinguishing different objects. In a sense, instance segmentation can be seen as semantic segmentation plus detection.

This article focuses on deep learning techniques, detailing 19 classic models that have achieved SOTA on image segmentation tasks, including both semantic segmentation and instance segmentation models.

2. FCN

Fully Convolutional Network (FCN) is a deep learning architecture specifically designed for image segmentation. It was first proposed in 2015 by researchers Jonathan Long, Evan Shelhamer, and Trevor Darrell from the University of California, Berkeley. Unlike traditional Convolutional Neural Networks (CNNs), FCNs are designed to handle input images of any size and output segmentation maps of corresponding sizes, with each pixel classified into a category.

Key features of FCNs include their composition entirely of convolutional layers, without any fully connected layers, allowing them to accept input images of any size. In traditional CNNs, fully connected layers require fixed-size input images because the parameters of the fully connected layers are related to the input size. In FCNs, fully connected layers are converted into convolutional layers, allowing for arbitrary size inputs and corresponding size output feature maps.

The architecture of FCN can be divided into the following parts:

  1. Convolutional and pooling layers: FCN uses a series of convolutional and pooling layers to extract image features. These layers can be transferred from pre-trained networks (such as VGG16, ResNet, etc.) or trained from scratch.

  2. Conversion of fully connected layers to convolutional layers: The fully connected layers used for classification in traditional CNNs are converted into convolutional layers in FCNs. For example, the fully connected layers in VGG16 can be converted into convolutional layers with the same number of filters, allowing for arbitrary size feature maps.

  3. Upsampling: Since convolution and pooling operations reduce the size of feature maps, FCNs use upsampling (such as transposed convolution, also known as deconvolution) to enlarge feature maps, gradually restoring them to the same resolution as the input image.

  4. Skip connections: To retain more detail during upsampling, FCNs introduce skip connections. These connections combine feature maps of different levels (i.e., feature maps of different resolutions) to preserve more details in the output segmentation map.

  5. End-to-end training: FCNs can be trained end-to-end, meaning training directly from input images to output segmentation maps. This allows for optimization of all network parameters through the backpropagation algorithm.

The following figure shows a schematic diagram of the FCN structure used for semantic segmentation:

FCN

3. ReSeg

ReSeg is an image segmentation model based on Recurrent Neural Networks (RNNs), developed from the Fully Convolutional Network (FCN). The main feature of ReSeg is the use of RNN's sequence processing capability to capture long-range dependencies in images, thereby improving segmentation accuracy.

The architecture of the ReSeg model typically includes the following key parts:

  1. Preprocessing convolutional layers: These layers are usually transferred from pre-trained CNN models (such as VGG or ResNet) to extract low-level image features. This part is equivalent to the convolutional and pooling layers in FCN.

  2. ReNet layers: The ReNet layer is the core of ReSeg, consisting of a series of RNN units. These RNN units process the features of each row or column of the image in a certain order (for example, from left to right, from top to bottom). In this way, the ReNet layer can capture spatial dependencies in the image. ReNet layers typically use Gated Recurrent Units (GRUs) or Long Short-Term Memory (LSTM) networks as the basic RNN units.

  3. Upsampling and skip connections: Similar to FCN, ReSeg also uses upsampling (such as transposed convolution) to restore the resolution of feature maps and combines features of different levels through skip connections to retain more detail.

  4. Classification layer: Finally, ReSeg uses one or more convolutional layers to classify the upsampled feature maps, assigning a category label to each pixel.

The advantage of the ReSeg model is its ability to effectively process contextual information in images, which is very important for understanding objects and their relationships in images. Through the RNN's recurrent processing mechanism, ReSeg can consider the spatial relationships between pixels, which is particularly critical in handling complex scene image segmentation tasks.

ReSeg has shown good performance on multiple image segmentation datasets, especially in tasks that require capturing long-range dependencies. However, due to the sequential processing nature of RNNs, ReSeg may face computational efficiency challenges when dealing with large-size images. Additionally, training RNNs is generally more complex than convolutional networks and requires more tuning.

Overall, ReSeg is an important contribution to the field of image segmentation, demonstrating the potential of RNNs in processing image spatial dependencies. Despite some challenges, ReSeg has provided new perspectives and methods for subsequent research. The ReSeg structure is shown in the figure:

ReSeg

4. U-Net

U-Net is a deep learning network for image segmentation, proposed by Olaf Ronneberger, Philipp Fischer, and Thomas Brox in 2015. It was originally designed to solve medical image segmentation problems, but its structure and performance have made it very popular in various image segmentation tasks. U-Net's design is particularly suitable for handling cases with a limited number of training samples.

The architecture of U-Net has a "U" shape, consisting of two main parts: the contracting path and the expansive path.

  1. Contracting path: Also known as the encoder, it mainly consists of a series of convolutional and pooling layers, used to capture the context of the image and reduce the spatial dimensions of the feature map. After each pooling operation, the size of the feature map is halved, and the number of feature channels is doubled. This process is similar to traditional convolutional neural networks and is used to extract features at different levels.

  2. Expansive path: Also known as the decoder, it gradually restores the size of the feature map through a series of upsampling operations and convolutional layers, while reducing the number of feature channels. After each upsampling, U-Net uses skip connections to concatenate the corresponding size feature maps from the encoder with the feature maps in the decoder, combining high-resolution features with upsampled features to retain more detail.

  3. Skip connections: Skip connections are a key feature of U-Net, allowing the network to directly pass low-level features to high levels, helping the network better recover image details during upsampling. This is crucial for precise edge localization and segmentation of small objects.

  4. Final layer: At the end of the expansive path, a 1x1 convolutional layer is used to map the feature map to the required number of categories, generating a classification label for each pixel.

Advantages of U-Net include:

  • Efficient data usage: Since medical image data is often difficult to obtain, U-Net's design allows it to learn effective feature representations even with limited training samples.
  • Precise localization: By combining contextual information and localization information through skip connections, U-Net achieves precise pixel-level segmentation.
  • Broad applicability: Although originally designed for medical images, U-Net has been proven to be very effective in many other types of image segmentation tasks.

The UNet network architecture is shown in the figure:

UNet

5. ParseNet

ParseNet is a deep learning model for semantic segmentation that improves upon the Fully Convolutional Network (FCN). The main contribution of ParseNet is the introduction of global context information to enhance the network's understanding of different regions in the image. This global context information helps the model better handle scale variations and relationships between regions, thereby improving segmentation accuracy.

The architecture of ParseNet mainly includes the following parts:

  1. Base convolutional network: ParseNet typically uses a pre-trained convolutional neural network (such as VGG or ResNet) as a feature extractor. These networks extract image features through a series of convolutional and pooling layers.

  2. Context module: The core of ParseNet is its context module, which is responsible for capturing global context information. This is achieved by applying global average pooling (GAP) after the last convolutional layer. The GAP operation averages each feature channel's feature map into a single value, resulting in a global feature vector that summarizes the context information of the entire image.

  3. Feature fusion: The global feature vector is then replicated and resized to match the spatial dimensions of the last convolutional layer's output feature map. This adjusted global feature map is then fused (usually by concatenation or addition) with the original feature map, so that the features at each position contain global context information.

  4. Classification layer: The feature map fused with global context information is then processed through one or more convolutional layers, and finally, a 1x1 convolutional layer is used to predict the category of each pixel.

By combining local features and global context information in this way, ParseNet enables the model to better handle different regions and objects in the image during semantic segmentation. For example, global context information can provide additional clues when segmenting small objects or identifying objects in complex backgrounds, helping the model make more accurate judgments, as shown in the figure below:

ParseNet

6. DeepMask

DeepMask is a deep learning model for object segmentation proposed by Pedro O. Pinheiro, Ronan Collobert, and Piotr Dollar from Facebook AI Research (FAIR) in 2015. DeepMask's goal is to generate segmentation masks for objects, i.e., pixel-level segmentation of each object in the image, while also predicting the object's bounding box.

The core idea of DeepMask is to combine object detection and segmentation into a unified network that learns both the location and shape of objects simultaneously. This approach is different from traditional object detection methods (such as the R-CNN series), which typically detect the object's bounding box first and then segment the pixels within the bounding box.

The architecture of DeepMask mainly includes the following parts:

  1. Convolutional Neural Network: DeepMask uses a convolutional neural network (CNN) as a feature extractor, which is usually a pre-trained network such as VGG or ResNet. The network extracts image features through a series of convolutional and pooling layers.

  2. Segmentation Module: On top of the feature extractor, DeepMask designs a segmentation module consisting of several convolutional layers to generate the object's segmentation mask. The output of this module is a binary mask indicating whether each pixel in the image belongs to an object.

  3. Object Score Module: In addition to the segmentation module, DeepMask also includes an object score module that uses the same feature extractor output to predict a score indicating whether the generated mask contains an object. This score is used to assess the quality of the mask.

  4. Multi-scale Processing: DeepMask adopts a multi-scale processing approach, processing the image at different scales to capture objects of different sizes. This is achieved by sliding windows over images of different resolutions and applying the network.

  5. Training Strategy: DeepMask is trained in an end-to-end framework, optimizing both the segmentation mask and the object score. Training data includes images as well as corresponding segmentation masks and object bounding boxes.

The introduction of DeepMask brought a new perspective to the field of object segmentation, emphasizing the importance of learning both the location and shape of objects simultaneously. DeepMask's ideas and techniques have influenced subsequent research, inspiring a series of new models and methods such as SharpMask and Instance Segmentation networks. These models have improved upon DeepMask's foundation, enhancing segmentation accuracy and efficiency. The figure below illustrates this:

DeepMask

7. SegNet

SegNet is a popular deep learning architecture specifically designed for image segmentation tasks, particularly excelling in scene understanding and pixel-level classification. It was proposed by Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla from the University of Cambridge in 2015. SegNet's main feature is its encoder-decoder architecture and the use of pooling indices, which make it both efficient and accurate in handling complex visual scenes.

The architecture of SegNet can be divided into two main parts:

  1. Encoder:

    • The encoder part consists of a series of convolutional layers and max-pooling layers, typically based on the pre-trained VGG16 network.
    • During each pooling step, the encoder not only reduces the spatial resolution of the feature map but also increases the number of feature channels, capturing more abstract feature representations.
    • Importantly, the encoder records the positions of the maximum values in each pooling area (known as pooling indices) during the pooling operation, which will be used in the decoder for upsampling.
  2. Decoder:

    • The decoder part contains a series of upsampling operations aimed at restoring the feature map to the original resolution of the input image.
    • The decoder uses the pooling indices recorded by the encoder to perform non-linear upsampling, a method known as max-unpooling.
    • After upsampling, the decoder further refines the feature map through a series of convolutional layers to recover the edges and details of the target.
    • Finally, the decoder outputs a feature map with the same resolution as the input image, with the feature vector at each pixel position used for pixel-level classification.

Key advantages of SegNet include:

  • Memory efficiency: Since the decoder uses pooling indices for upsampling rather than storing feature maps, SegNet is more memory-efficient than some other segmentation networks.
  • Computational efficiency: The design of SegNet's decoder is simple, not requiring complex upsampling operations, thus it is computationally efficient.
  • Edge preservation: By using pooling indices for upsampling, SegNet can better recover image edge information, which is crucial for image segmentation tasks.

The SegNet structure is as follows:

SegNet

8. Instance-Aware Segmentation

Instance-Aware Segmentation is an image segmentation method that not only identifies objects in the image (like semantic segmentation) but also distinguishes each individual object instance. This means that for multiple objects of the same category in the image, Instance-Aware Segmentation can identify and segment them separately.

Instance-Aware Segmentation typically involves the following key steps:

  1. Object Detection: First, the model needs to detect all possible objects in the image. This is usually achieved through object detection models such as Faster R-CNN, SSD, or YOLO. These models can predict the bounding boxes and categories of objects.

  2. Object Segmentation: For each detected object, the model then needs to generate a corresponding segmentation mask, i.e., precisely marking the object's contour. This can be achieved through various methods, such as using candidate regions generated by a Region Proposal Network (RPN) and then segmenting each region.

  3. Instance Differentiation: While generating segmentation masks, the model needs to distinguish different instances. This means that even if two objects belong to the same category, they should be identified as different instances, each with its own independent mask.

Representative models of Instance-Aware Segmentation include:

  • Mask R-CNN: Proposed by Kaiming He and others in 2017, Mask R-CNN is a milestone in the field of Instance-Aware Segmentation. Mask R-CNN adds a branch to Faster R-CNN for generating segmentation masks for objects. It uses ROI Align technology to precisely extract regions of interest from feature maps and generate high-quality segmentation masks for each region.

  • YOLACT: A real-time Instance-Aware Segmentation method that decomposes the segmentation task into two parallel tasks: generating a set of prototype masks and predicting coefficients for each object. The final segmentation mask is obtained by linearly combining the prototype masks with the coefficients.

  • SOLO (Segmenting Objects by Locations): SOLO is an end-to-end Instance-Aware Segmentation method that directly predicts object categories and instance information at each pixel location, without the need for candidate regions or anchors.

Instance-Aware Segmentation is crucial for many applications such as autonomous driving, robotic navigation, video surveillance, and medical image analysis. Compared to traditional semantic segmentation, Instance-Aware Segmentation provides a more detailed understanding of the scene because it can distinguish each object instance in the image. The challenge of this method lies in accurately detecting, segmenting, and distinguishing each instance, especially in cases of object overlap or occlusion. With the development of deep learning technology, the accuracy and efficiency of Instance-Aware Segmentation have significantly improved.

Instance-Aware Segmentation

9. DeepLab

DeepLab is a series of deep learning models for semantic segmentation developed by Liang-Chieh Chen and others from Google. Since its inception in 2014, DeepLab has gone through multiple iterations, including DeepLab v1, v2, v3, and v3+. The core feature of DeepLab models is their combination of atrous convolution (also known as dilated convolution) and fully connected conditional random fields (CRFs) to improve segmentation accuracy and speed.

Here are some key components of DeepLab models:

  1. Atrous Convolution:

    • Atrous convolution is one of the core components of DeepLab, allowing the network to increase the receptive field (the size of the input image area that the network can "see") without reducing image resolution.
    • By adjusting the dilation rate in the convolutional kernel, the size of the receptive field can be controlled, which is useful for capturing multi-scale information in images.
  2. ASPP (Atrous Spatial Pyramid Pooling):

    • ASPP was introduced in DeepLab v2, using atrous convolutions with different dilation rates to capture multi-scale information in parallel.
    • In DeepLab v3, ASPP was further improved by adding a global average pooling layer to capture broader image context information.
  3. Fully Connected CRFs:

    • In the early versions of DeepLab, fully connected CRFs were used as a post-processing step to refine segmentation results and better capture details, especially near object edges.
    • CRFs model the relationships between pixels, effectively smoothing noise in segmentation results while maintaining edge clarity.
  4. Encoder-Decoder Structure:

    • DeepLab v3+ introduced an encoder-decoder structure, where the encoder uses atrous convolution to extract features, and the decoder is used to restore the resolution of segmentation results.
    • The decoder part includes a simple upsampling module that combines deep features from the encoder with low-level features to improve segmentation accuracy.
  5. Depthwise Separable Convolution:

    • In DeepLab v3+, depthwise separable convolution is used to improve model efficiency. This type of convolution decomposes standard convolution into depthwise convolution (per-channel convolution) and pointwise convolution (1x1 convolution), reducing the number of model parameters and computational load.

DeepLab models have achieved state-of-the-art performance on multiple public semantic segmentation datasets, including PASCAL VOC, Cityscapes, and ADE20K. They excel at handling objects with complex textures and various scales, thanks to the ability of atrous convolution and ASPP to capture multi-scale information, and the role of CRFs in refining segmentation results.

The DeepLab series of models have had a significant impact on the development of the field of semantic segmentation. Their design concepts and techniques have been widely applied to other segmentation models and have inspired a series of innovative subsequent research.

DeepLab

10. DeepLabv3

DeepLabv3 is the third iteration of the DeepLab model series, proposed by Google in 2017. DeepLabv3 continues to use core features of the DeepLab series, such as atrous convolution, and introduces an improved Atrous Spatial Pyramid Pooling (ASPP) module to more effectively capture multi-scale information and enhance semantic segmentation performance.

Key features and components of DeepLabv3 include:

  1. Atrous Convolution:

    • DeepLabv3 uses atrous convolution to expand the receptive field, allowing the model to capture broader contextual information without reducing the resolution of the feature map.
    • Atrous convolution achieves this by inserting "holes" (i.e., increasing the spacing between elements in the convolutional kernel), which increases the receptive field without increasing the number of parameters.
  2. Improved ASPP:

    • ASPP is a core component of DeepLabv3, using multiple atrous convolutions with different dilation rates to process the input feature map in parallel, capturing information at different scales.
    • In DeepLabv3, ASPP was further improved by adding a global average pooling layer to capture image-level contextual information. The output of global average pooling is upsampled and concatenated with other scale feature maps to form the final ASPP output.
  3. Encoder-Decoder Structure:

    • While DeepLabv3 itself does not adopt a typical encoder-decoder structure, its ASPP module can be seen as part of the encoder, and in the subsequent DeepLabv3+, a decoder structure was introduced to further improve segmentation accuracy.
  4. Depthwise Separable Convolution:

    • Although DeepLabv3 does not explicitly use depthwise separable convolution, it was used in DeepLabv3+ to improve model efficiency.
  5. End-to-End Training:

    • DeepLabv3 can be trained end-to-end, meaning the entire process from input image to segmentation map can be optimized through backpropagation.

DeepLabv3 has achieved significant performance improvements on multiple public semantic segmentation datasets, especially in handling objects with complex textures and various scales. Its success demonstrates the effectiveness of using atrous convolution and ASPP to capture multi-scale information in deep learning models.

The design ideas and techniques of DeepLabv3 have had a profound impact on subsequent semantic segmentation research and laid the foundation for the next version of the DeepLab series, DeepLabv3+, which further improved segmentation accuracy and detail recovery capabilities by introducing a decoder structure.

DeepLabv3

11. RefineNet

RefineNet is a deep learning network for image semantic segmentation, proposed by Guosheng Lin, Anton Milan, Chunhua Shen, and Ian Reid from the Australian National University in 2016. RefineNet is designed to address the issue of resolution reduction and loss of detail information during segmentation due to consecutive pooling or convolution operations. To overcome this challenge, RefineNet adopts a Multi-Path Refinement Network that effectively integrates features of different resolutions and progressively restores image details and resolution.

Key features and components of RefineNet include:

  1. Multi-scale Feature Fusion:

    • RefineNet establishes multiple refinement modules (RefineNet blocks) to fuse feature maps from different stages. These modules combine deep, semantically rich features with shallow, detail-rich features.
    • Each refinement module contains a residual convolutional unit for processing and fusing features.
  2. Residual Connections:

    • RefineNet uses residual connections within refinement modules to facilitate gradient propagation and aid network training.
  3. Long Skip Connections:

    • To preserve high-resolution detail information, RefineNet introduces long skip connections that directly pass shallow features to the deeper parts of the network.
  4. Progressive Upsampling:

    • RefineNet progressively upsamples feature maps rather than performing a one-time large-scale upsampling at the end of the network. This progressive approach helps to gradually restore image details.
  5. Chained Residual Pooling:

    • RefineNet proposed a chained residual pooling structure that captures multi-scale information in background regions, helping to improve segmentation of large homogeneous areas.

RefineNet has achieved excellent performance on multiple public semantic segmentation datasets, including PASCAL VOC, Cityscapes, and NYUDv2. Its success lies in effectively combining deep and shallow features while preserving image details and contextual information.

The design philosophy of RefineNet has influenced subsequent image segmentation research, particularly in how to effectively combine features of different levels. Its multi-path refinement strategy has proven to be an effective method for improving accuracy and detail recovery capabilities in segmentation tasks.

RefineNet

12. PSPNet

PSPNet (Pyramid Scene Parsing Network) is an efficient semantic segmentation network proposed by Hengshuang Zhao and others from Huazhong University of Science and Technology in 2016. The core innovation of PSPNet is the introduction of a Pyramid Pooling Module that captures global context information from different regions, thereby enhancing the understanding of complex scenes.

Key features and components of PSPNet include:

  1. Base Network:

    • PSPNet typically uses a pre-trained deep convolutional neural network (such as ResNet) as its backbone network for extracting image features.
  2. Pyramid Pooling Module:

    • The Pyramid Pooling Module is the core of PSPNet, consisting of multiple pooling layers at different scales. Each pooling layer performs global average pooling on the feature map to capture contextual information from different regions.
    • The outputs of these pooling layers are upsampled to the same size as the original feature map and concatenated together, allowing multi-scale contextual information to be fused at each pixel location.
  3. Convolutional Layer and Upsampling:

    • The concatenated feature map is processed by an additional convolutional layer to fuse features of different scales.
    • The feature map is then upsampled to the original resolution of the input image for pixel-level classification.
  4. Auxiliary Loss:

    • During training, PSPNet may add an auxiliary classifier at intermediate layers of the backbone network to provide additional gradient signals and improve model training.
  5. End-to-End Training:

    • PSPNet can be trained end-to-end, meaning the entire process from input image to segmentation map can be optimized through backpropagation.

PSPNet has achieved significant performance on multiple public semantic segmentation datasets, including PASCAL VOC, Cityscapes, and ADE20K. It is particularly adept at handling scenes that require modeling of large-scale contextual information, such as street scene understanding and indoor layout analysis.

The success of PSPNet demonstrates the effectiveness of the Pyramid Pooling Module in capturing multi-scale contextual information, which is crucial for improving the accuracy of semantic segmentation. The design philosophy and techniques of PSPNet have had a profound impact on subsequent semantic segmentation research and have inspired a series of new models and methods that further explore how to effectively integrate local and global information in deep learning models.

PSPNet

13. Dense-Net

DenseNet (Densely Connected Convolutional Networks) is a deep convolutional neural network architecture proposed by Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger in 2016. The core idea of DenseNet is to improve the flow of information and gradients through dense connectivity, thereby enhancing network performance and reducing the number of parameters.

Key features and components of DenseNet include:

  1. Dense Connectivity:

    • In DenseNet, the output of each convolutional layer is used as the input for all subsequent layers. This means that the input to the ith layer includes not only the output of the i-1th layer but also all outputs from the 0th to the i-2th layers.
    • This dense connectivity pattern ensures that each layer in the network has direct access to the feature maps of all preceding layers, improving feature reuse and reducing the number of parameters.
  2. Growth Rate:

    • The growth rate is a key hyperparameter in DenseNet, defining the number of feature map channels produced by each convolutional layer. Since the output of each layer is accumulated, the growth rate is typically set to be relatively small to control the complexity of the model.
  3. Composite Function:

    • Within each dense block, convolutional layers are typically composed of a composite function consisting of Batch Normalization, an activation function (such as ReLU), and a convolution operation.
  4. Transition Layers:

    • To control the size of the feature maps, DenseNet introduces transition layers between different dense blocks. Transition layers typically include batch normalization, 1x1 convolution, and average pooling operations.
  5. Global Average Pooling:

    • At the end of the network, DenseNet uses a global average pooling layer to reduce the spatial dimensions of the feature map, which helps to reduce model parameters and prevent overfitting.

Advantages of DenseNet include:

  • Parameter Efficiency: Due to feature reuse, DenseNet can achieve comparable or better performance with fewer parameters than other networks.
  • Improved Gradient Flow: Dense connectivity provides direct paths for gradients, aiding in training deeper networks.
  • Feature Propagation: Every layer in the network has access to the original input and all preceding layer features, improving the efficiency of feature propagation.
  • Feature Reuse: The network can reuse features across different layers, reducing redundancy.

DenseNet has achieved significant performance on various image classification tasks, including CIFAR-10, CIFAR-100, and ImageNet datasets. Its design philosophy has also inspired the development of other network architectures, such as ResNeXt and Dual Path Networks (DPN). DenseNet's dense connectivity pattern is important for understanding feature and gradient flow in deep learning models.

Dense-Net

14. Mask-Lab

Mask R-CNN is a popular instance segmentation model proposed by Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick from Facebook AI Research (FAIR) in 2017. Mask R-CNN builds upon Faster R-CNN by adding a branch for generating high-quality segmentation masks for each detected object, achieving precise pixel-level segmentation of each object in the image.

The architecture of Mask R-CNN mainly includes the following parts:

  1. Backbone Network:

    • Mask R-CNN uses a deep convolutional neural network (such as ResNet with a Feature Pyramid Network FPN) as the backbone network for extracting image features.
  2. Region Proposal Network (RPN):

    • RPN generates object candidate regions (proposals) from the feature maps of the backbone network. These candidate regions are then used for detection and segmentation tasks.
  3. RoI Align:

    • Mask R-CNN introduced the RoI Align layer to address the quantization error issue in RoI Pooling. RoI Align uses bilinear interpolation to compute the feature map of the region of interest accurately, maintaining spatial consistency of the feature map.
  4. Classification and Bounding Box Regression:

    • For each candidate region, Mask R-CNN predicts the object's category and fine-tunes the bounding box adjustment.
  5. Mask Prediction:

    • Mask R-CNN adds a parallel branch for each candidate region to predict the object's segmentation mask. This branch is a small Fully Convolutional Network (FCN) that outputs a binary mask for each category.
  6. Multi-task Loss:

    • Mask R-CNN is trained by minimizing a multi-task loss that combines classification loss, bounding box regression loss, and mask prediction loss.

Mask R-CNN has achieved remarkable performance on various instance segmentation benchmarks, including COCO and PASCAL VOC datasets. Its success lies in effectively combining object detection and pixel-level segmentation while maintaining efficient computational performance.

The design of Mask R-CNN has had a profound impact on the field of instance segmentation, and its ideas and techniques have been widely applied to many subsequent research and applications. Additionally, the flexibility and extensibility of Mask R-CNN make it adaptable to various tasks and scenarios, including human pose estimation, 3D reconstruction, and video segmentation.

Mask-Lab

15. PANet

PANet (Path Aggregation Network) is a deep learning architecture for instance segmentation and object detection, proposed by Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia in 2018. PANet aims to improve model performance by enhancing the feature pyramid and the use of bottom-level features. It builds upon Mask R-CNN with improvements, particularly in feature extraction and information flow.

Key features and components of PANet include:

  1. Enhanced Feature Pyramid Network (FPN):

    • PANet enhances FPN to improve information flow. In traditional FPN, features flow from the bottom to the top; PANet introduces bottom-up path augmentation, allowing high-level semantic information to propagate better to lower levels.
  2. Adaptive Feature Pooling:

    • PANet proposes adaptive feature pooling, allowing the model to pool features at all levels of the feature pyramid, not just at a specific level. This better utilizes multi-scale features.
  3. Global Context Module:

    • PANet introduces a global context module used after RoI Align to capture the context of the entire image and integrate this information into the features of each RoI.
  4. Stronger Mask Branch:

    • Building on Mask R-CNN, PANet enhances the mask branch to predict more accurate segmentation masks.
  5. Multi-task Training Strategy:

    • PANet adopts a multi-task training strategy, optimizing both object detection and instance segmentation tasks simultaneously.

PANet

16. DANet

DANet (Dual Attention Network) is a deep learning model for semantic segmentation proposed by Jun Fu, Jing Liu, Hao Tian, Yong Li, Yihang Bao, Zhiwei Fang, and Hanqing Lu from Huazhong University of Science and Technology in 2019. The core innovation of DANet is the introduction of two attention mechanisms after the feature extraction network: Spatial Attention and Channel Attention, to enhance the model's adaptability and discrimination for different regions and features in the image.

Key features and components of DANet include:

  1. Base Network:

    • DANet typically uses a pre-trained deep convolutional neural network (such as ResNet) as its backbone network for extracting image features.
  2. Position Attention Module (PAM):

    • The PAM module captures long-range spatial dependencies by calculating the relationship between each position and all other positions in the feature map. This helps the model better understand the context in the image and improve segmentation accuracy.
  3. Channel Attention Module (CAM):

    • The CAM module focuses on the inter-channel relationships of the feature map, enhancing the model's discrimination ability for different semantic features by calculating the interdependence between channels.
  4. Feature Fusion:

    • DANet fuses the outputs of the PAM and CAM modules with the original feature map to integrate spatial and channel attention information.
  5. End-to-End Training:

    • DANet can be trained end-to-end, meaning the entire process from input image to segmentation map can be optimized through backpropagation.

DANet has achieved excellent performance on multiple public semantic segmentation datasets, including Cityscapes, PASCAL VOC, and ADE20K. It is particularly adept at handling tasks that require fine-grained segmentation of complex scenes.

The success of DANet demonstrates the effectiveness of attention mechanisms in improving semantic segmentation performance. By introducing spatial and channel attention, DANet can better capture details and context in images, thereby improving segmentation accuracy. The design philosophy and techniques of DANet have influenced subsequent semantic segmentation research and have inspired a series of new models and methods that further explore how to effectively integrate attention mechanisms in deep learning models.

DANet

17. FastFCN

FastFCN (Fast Fully Convolutional Network) is a deep learning model for semantic segmentation proposed by Huikai Wu, Junge Zhang, and Yuandong Tian from Huazhong University of Science and Technology in 2019. FastFCN aims to address the efficiency issues of traditional Fully Convolutional Networks (FCNs) when processing large-size input images. To improve efficiency, FastFCN introduces the Joint Upsampling (JPU) module and adopts an encoder-decoder architecture.

Key features and components of FastFCN include:

  1. Encoder:

    • FastFCN uses a pre-trained deep convolutional neural network (such as ResNet) as the encoder to extract image features.
    • The encoder gradually reduces the spatial resolution of the feature map through a series of convolutional and pooling layers, increasing the depth and semantic richness of the features.
  2. Joint Upsampling (JPU) Module:

    • The JPU module is the core of FastFCN, efficiently fusing and upsampling multi-scale features.
    • Unlike traditional layer-by-layer upsampling methods, JPU processes feature maps of multiple scales in parallel within a single module, using fewer upsampling operations, thus improving efficiency.
  3. Decoder:

    • The decoder part uses the output of the JPU module to restore the spatial resolution of the feature map and generate the final segmentation map.
    • The decoder typically includes some convolutional layers for further refining features and performing pixel-level classification.
  4. Auxiliary Loss:

    • During training, FastFCN may add an auxiliary classifier at intermediate layers of the encoder to provide additional gradient signals and improve model training.
  5. End-to-End Training:

    • FastFCN can be trained end-to-end, meaning the entire process from input image to segmentation map can be optimized through backpropagation.

FastFCN has achieved excellent performance on multiple public semantic segmentation datasets, including Cityscapes and ADE20K. Its success lies in effectively processing large-size input images while maintaining high segmentation accuracy and improving computational efficiency.

FastFCN

18. Gated-SCNN

Gated-SCNN (Gated Shape CNN) is a deep learning model for semantic segmentation proposed by Towaki Takikawa, David Acuna, Varun Jampani, and Sanja Fidler in 2019. Gated-SCNN's main feature is the introduction of a dual-branch network structure, with one branch for capturing shape information and another for regular semantic segmentation. This design aims to improve segmentation accuracy by explicitly modeling and utilizing shape information.

Key features and components of Gated-SCNN include:

  1. Dual-Branch Network Structure:

    • Gated-SCNN contains two parallel branches: one for regular semantic segmentation (referred to as the "semantic branch") and another specifically for capturing object shape information (referred to as the "shape branch").
  2. Shape Branch:

    • The shape branch uses a series of convolutional layers and gated convolutional layers to extract edge and contour information of objects in the image. These gated convolutional layers can learn to control the flow of information, better capturing shape features.
  3. Semantic Branch:

    • The semantic branch uses a standard convolutional neural network structure to extract semantic information from the image and perform pixel-level classification.
  4. Feature Fusion:

    • Gated-SCNN uses a special fusion module to combine the features from the shape and semantic branches. This allows the two types of information to complement each other, improving segmentation performance.
  5. Gating Mechanism:

    • The gating mechanism is key to Gated-SCNN, allowing the model to dynamically adjust the flow of information between different features. This mechanism helps the model focus on important features and suppress irrelevant or noisy features.
  6. End-to-End Training:

    • Gated-SCNN can be trained end-to-end, meaning the entire process from input image to segmentation map can be optimized through backpropagation.

Gated-SCNN has achieved excellent performance on multiple public semantic segmentation datasets, particularly excelling in handling image details and edges. Its success proves that explicitly modeling shape information can improve segmentation accuracy.

Gated-SCNN

19. OneFormer

OneFormer is a unified visual Transformer that consolidates various visual tasks such as image classification, object detection, and semantic segmentation into a single model architecture. It was proposed by the Google AI team in 2022.

Architecture

OneFormer's architecture is based on the Transformer, a neural network architecture that relies on attention mechanisms. The model consists of the following main components:

  • Image Encoder: Encodes the input image into a set of feature vectors.
  • Transformer Layers: Processes the feature vectors using self-attention mechanisms to capture global and local relationships within the image.
  • Task-Specific Heads: Transforms the output of the Transformer layers into predictions specific to different visual tasks.

Unity

OneFormer's unity is reflected in its ability to handle multiple visual tasks without significant modifications to the model architecture. This is achieved by using task-specific heads that transform the output of the Transformer layers into predictions for specific tasks. For example, for image classification, the task-specific head will output a probability distribution indicating the likelihood of the image belonging to different categories. For object detection, the task-specific head will output a set of bounding boxes and corresponding class labels.

Advantages

OneFormer has the following advantages:

  • Unity: It can handle multiple visual tasks without significant modifications to the model architecture.
  • Efficiency: It uses Transformer layers, which are an efficient architecture capable of parallel processing.
  • Accuracy: It has achieved state-of-the-art performance across various visual tasks.

Applications

OneFormer has been successfully applied to the following visual tasks:

  • Image classification
  • Object detection
  • Semantic segmentation
  • Instance segmentation
  • Panoptic segmentation

OneFormer

20. PSPNet-ResNet50_PSSL

While fine-tuning pre-trained networks has become a popular way to train image segmentation models, the backbone networks used for image segmentation are often pre-trained on image classification source datasets, such as ImageNet. Although image classification datasets can provide rich visual features and discriminative capabilities for the backbone network, they do not fully pre-train the target model (i.e., the backbone network + segmentation module) in an end-to-end manner. The method using Pseudo Semantic Segmentation Labels (PSSL) is employed to achieve end-to-end pre-training of image segmentation models based on classification datasets. By interpreting classification results and aggregating interpretations queried from multiple classifiers to reduce the bias caused by a single model, PSSL for each image is obtained. Using PSSL for each image on ImageNet, the proposed method employs a weighted segmentation learning procedure to pre-train the segmentation network on a large scale.

PSPNet-ResNet50_PSSL

21. Conclusion

This article reviewed 19 deep learning models in the field of computer vision for image segmentation tasks. Image segmentation technology has a wide range of applications in fields such as scene understanding, medical image analysis, robotic perception, and more.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .