Mask R-CNN

Sobhan Shukueian
2 min readJan 23, 2022


Photo by Stefano Ciociola on Unsplash


  • Adding a branch for predicting an object mask in parallel.
  • ROI Alignment instead of ROI Pooling.


Mask R-CNN, extends Faster R-CNN by adding a branch for predicting segmentation masks on each Region of Interest (RoI), in parallel with the existing branch for classification and bounding box regression. The mask branch is a small FCN applied to each RoI, predicting a segmentation mask in a pixel-to-pixel manner.

Faster RCNN was not designed for pixel-to-pixel alignment between network inputs and outputs. This is most evident in how RoIPool performs coarse spatial quantization for feature extraction. To fix the misalignment, we propose a simple, quantization-free layer, called RoIAlign, that faithfully preserves exact spatial locations.


  • The mask branch only adds a small computational overhead.
  • RoIAlign has a large impact: it improves mask accuracy by a relative 10% to 50%.
  • The additional mask output is a binary mask for each RoI and is distinct from the class and box outputs. This is in contrast to most recent systems, where classification depends on mask predictions.


Mask R-CNN defines a multi-task loss on each sampled RoI as

L = Lcls + Lbox + Lmask

The classification loss Lcls and bounding-box loss Lbox are identical to those defined in Fast R-CNN.
The mask branch has a Km2- dimensional output for each RoI, which encodes K binary masks of resolution m*m, one for each of the K classes. We apply a per-pixel sigmoid and define Lmask as the average binary cross-entropy loss. For an RoI associated with ground-truth class k, Lmask is only defined on the kth mask (other mask outputs do not contribute to the loss). This definition of Lmask allows the network to generate masks for every class without competition among classes.


Mask encodes an input object’s spatial layout. Thus, unlike class labels or box offsets that are inevitably collapsed into short output vectors by fully connected (Fc) layers, extracting the spatial structure of masks can be addressed naturally by the pixel-to-pixel correspondence provided by convolutions.