R-CNN Family

Sobhan Shukueian
5 min readJan 28, 2022


Photo by Sigmund on Unsplash


1 — R-CNN takes an input image, extracts around 2000 bottom-up region proposals, These proposed regions are usually selected at multiple scales with different shapes and sizes. Each region proposal will be labeled with a class and a ground-truth bounding box.
2 — Computes features for each proposal using a large convolutional neural network (CNN).(Resize each region proposal to the input size required by the network, and output the extracted features for the region proposal through forward propagation.)
3 — Classifies each region using class-specific linear SVMs.
4 — Train a linear regression model to predict the ground-truth bounding box.

Selective Search

This algorithm starts with making many small windows or filters and uses the greedy algorithm to grow the region. Then it locates the similar colors in the regions and merges them together.

Cons & Pros

  • It takes around 47 seconds for each test image so it can’t be
    used for real-time detection.
  • In the Region Proposal detection section(selective search) we don’t have any parameter and its fixed algorithm so we can’t optimize it.
  • According to thousands of region proposals from a single input image we require thousands of CNN forward propagations to perform object detection.
  • Training is expensive in space and time.
  • Training is a multi-stage pipeline.

Fast R-CNN

1 — The network first processes the whole image with several convolutional (conv) and max-pooling layers to produce a conv feature map.
2 — For each object proposal a region of interest (RoI) pooling layer extracts a fixed-length feature vector from the feature map.
3 — Each feature vector is fed into a sequence of fully connected (FC) layers that finally branch into two sibling output layers:

  • One that produces softmax probability estimates over K object classes plus a catch-all “background” class.
  • Another layer that outputs four real-valued numbers for each of the K object classes (Each set of 4 values encodes refined bounding-box positions for one of the K classes).

ROI pooling

The RoI pooling layer uses max pooling to convert the features inside any valid region of interest into a small feature map with a fixed spatial extent of H and W (e.g., 7 7), where H and W are layers hyper-parameters that are independent of any particular RoI.

Multi-task loss

A Fast R-CNN network has two sibling output layers.


1. Higher detection quality (mAP) than R-CNN, SPPnet
2. Training is single-stage, using a multi-task loss
3. Training can update all network layers
4. No disk storage is required for feature caching


Region generator (selective search) is a fixed algorithm so it doesn’t optimize during training.

Faster R-CNN

The Faster R-CNN architecture consists of the RPN as a region proposal algorithm and the Fast R-CNN as a detector network.

RPN(Region Proposal Networks)

An RPN is a fully convolutional network that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection.


  • Detection with a VGG RPN takes 198ms compared to the 1.8 seconds of Selective Search.
  • In ablation studies to observe the importance of scale and aspect ratios of anchor boxes, the authors find that using 3 scales with a single aspect ratio works almost as well as 3 scales and 3 aspect ratios. Depending on the task and the dataset, these ratios and scales can be modified. Using a single anchor at each location causes the mAP to drop considerably.

Mask R-CNN

Mask R-CNN, extends Faster R-CNN by adding a branch for predicting segmentation masks on each Region of Interest (RoI), in parallel with the existing branch for classification and bounding box regression. The mask branch is a small FCN applied to each RoI, predicting a segmentation mask in a pixel-to-pixel manner.
Faster RCNN was not designed for pixel-to-pixel alignment between network inputs and outputs. This is most evident in how RoIPool performs coarse spatial quantization for feature extraction. To fix the misalignment, we propose a simple, quantization-free layer, called RoIAlign, that faithfully preserves exact spatial locations.