A Brief History of CNNs in Image Segmentation: From R-CNN to Mask R-CNN

2017: Mask R-CNN - Extending Faster R-CNN for Pixel Level Segmentation

The goal of image instance segmentation is to identify, at a pixel level, what the different objets in a scene are.

So far, we’ve seen how we’ve been able to use CNN features in many interesting ways to effectively locate different objects in an image with bounding boxes.

Can we extend such techniques to go one step further and locate exact pixels of each object instead of just bounding boxes? This problem, known as image segmentation, is what Kaiming He and a team of researchers, including Girshick, explored at Facebook AI using an architecture known as Mask R-CNN.

Kaiming He, a researcher at Facebook AI, is lead author of Mask R-CNN and also a coauthor of Faster R-CNN.

Much like Fast R-CNN, and Faster R-CNN, Mask R-CNN’s underlying intuition is quite simple. Given that Faster R-CNN works so well for object detection, could we simply extend it to also carry out pixel level segmentation?

In Mask R-CNN, a simple Fully Convolutional Network (FCN) is added on top of the CNN features of Faster R-CNN to generate a mask (segmentation output). Notice how this is in parallel to the classification and bounding box regression network of Faster R-CNN.

Mask R-CNN does this by simply adding a branch to Faster R-CNN that outputs a binary mask that says whether or not a given pixel is part of an object. The branch (in white in the above image), as before, is just a simple Fully Convolutional Network on top of a CNN based feature map. Here are its inputs and outputs:

Inputs: CNN Feature Map.
Outputs: Matrix with 1s on all locations where the pixel belongs to the object and 0s elsewhere (this is known as a binary mask).

But the Mask R-CNN authors had to make one small adjustment to make this pipeline work as expected.

RoiAlign - Realigning RoIPool to be More Accurate

Instead of RoIPool, the image gets passed through RoIAlign so that the regions of the feature map selected by RoIPool correspond more precisely to the regions of the original image. This is needed because pixel level segmentation requires more fine-grained alignment than bounding boxes.

When run without modifications on the original Faster R-CNN architecture, the Mask R-CNN authors realized that the regions of the feature map selected by RoIPool were slightly misaligned from the regions of the original image. Since image segmentation requires pixel level specificity, unlike bounding boxes, this naturally led to inaccuracies.

The authors were able to solve this problem by simply adjusting RoIPool to be more precisely aligned using a method known as RoIAlign.

How do we accurately map a region of interest from the original image onto the feature map?

Imagine we have an image of size 128x128 and a feature map of size 25x25. Let’s imagine we want features the region corresponding to the top-left 15x15 pixels in the original image (see above). How might we select these pixels from the feature map?

We know each pixel in the original image corresponds to ~ 25/128 pixels in the original image. To select 15 pixels from the original image, we just select 15 * 25/128 ~= 2.93 pixels.

In RoIPool, we would round this down and select 2 pixels causing a slight misalignment. However, in RoIAlign, we avoid such rounding. Instead, we use bilinear interpolation to get a precise idea of what would be at pixel 2.93. This, at a high level, is what allows us to avoid the misalignments caused by RoIPool.

Once these masks are generated, Mask R-CNN simply combines them with the classifications and bounding boxes from Faster R-CNN to generate such wonderfully precise segmentations: