2017: Mask R-CNN - Extending Faster R-CNN for Pixel Level Segmentation
So far, we’ve seen how we’ve been able to use CNN features in many interesting ways to effectively locate different objects in an image with bounding boxes.
Can we extend such techniques to go one step further and locate exact pixels of each object instead of just bounding boxes? This problem, known as image segmentation, is what Kaiming He and a team of researchers, including Girshick, explored at Facebook AI using an architecture known as Mask R-CNN.
Much like Fast R-CNN, and Faster R-CNN, Mask R-CNN’s underlying intuition is quite simple. Given that Faster R-CNN works so well for object detection, could we simply extend it to also carry out pixel level segmentation?
Mask R-CNN does this by simply adding a branch to Faster R-CNN that outputs a binary mask that says whether or not a given pixel is part of an object. The branch (in white in the above image), as before, is just a simple Fully Convolutional Network on top of a CNN based feature map. Here are its inputs and outputs:
- Inputs: CNN Feature Map.
- Outputs: Matrix with 1s on all locations where the pixel belongs to the object and 0s elsewhere (this is known as a binary mask).
But the Mask R-CNN authors had to make one small adjustment to make this pipeline work as expected.
RoiAlign - Realigning RoIPool to be More Accurate
When run without modifications on the original Faster R-CNN architecture, the Mask R-CNN authors realized that the regions of the feature map selected by RoIPool were slightly misaligned from the regions of the original image. Since image segmentation requires pixel level specificity, unlike bounding boxes, this naturally led to inaccuracies.
The authors were able to solve this problem by simply adjusting RoIPool to be more precisely aligned using a method known as RoIAlign.
Imagine we have an image of size 128x128 and a feature map of size 25x25. Let’s imagine we want features the region corresponding to the top-left 15x15 pixels in the original image (see above). How might we select these pixels from the feature map?
We know each pixel in the original image corresponds to ~ 25/128 pixels in the original image. To select 15 pixels from the original image, we just select 15 * 25/128 ~= 2.93 pixels.
In RoIPool, we would round this down and select 2 pixels causing a slight misalignment. However, in RoIAlign, we avoid such rounding. Instead, we use bilinear interpolation to get a precise idea of what would be at pixel 2.93. This, at a high level, is what allows us to avoid the misalignments caused by RoIPool.
Once these masks are generated, Mask R-CNN simply combines them with the classifications and bounding boxes from Faster R-CNN to generate such wonderfully precise segmentations: