R-CNN Object Detection and Semantic Segmentation

Sobhan Shukueian
4 min readDec 31, 2021


Photo by Alec Favale on Unsplash


Some Requirements that is needed in the main architecture of R-CNN if you are familiar with them you can skip this part :)


Intersection over Union (IoU) is used when calculating mAP. It is a number from 0 to 1 that specifies the amount of overlap between the predicted and ground truth bounding box.

The aim of the model would be to keep improving its prediction, until two boxes perfectly overlap, i.e the IOU between the two boxes becomes equal to 1.

Selective Search

There can be various approaches to perform object localization in any object detection procedure. Using sliding filters of different sizes on the image to extract the object from the image can be one approach that we call an exhaustive search approach. As the number of filters or windows will increases, the computation effort will increase in an exhaustive search approach.

This algorithm starts with making many small windows or filters and uses the greedy algorithm to grow the region. Then it locates the similar colours in the regions and merges them together.

The similarity between the regions can be calculated by:


Where the Stexture(a,b) is visual similarity and Ssize(a,b) similarity between the regions.

Using this algorithm, the model continues to merge all the regions together to improve the size of the regions. The image is a representation of a selective search algorithm.

What is R-CNN?

R-CNNs ( Region-based Convolutional Neural Networks) is a family of machine learning models used in computer vision and image processing. Specially designed for object detection, the original goal of any R-CNN is to detect objects in any input image defining boundaries around them.

1 — R-CNN takes an input image, extracts around 2000 bottom-up region proposals, These proposed regions are usually selected at multiple scales with different shapes and sizes. Each region proposal will be labeled with a class and a ground-truth bounding box.
2 — Computes features for each proposal using a large convolutional neural network (CNN).(Resize each region proposal to the input size required by the network, and output the extracted features for the region proposal through forward propagation.)
3 — Classifies each region using class-specific linear SVMs.
4 — Train a linear regression model to predict the ground-truth bounding box.

Although the R-CNN model uses pre-trained CNNs to effectively extract image features, it is slow. Imagine that we select thousands of region proposals from a single input image: this requires thousands of CNN forward propagations to perform object detection. This massive computing load makes it infeasible to widely use R-CNNs in real-world applications.

The whole process architecture of R-CNN can be represented as the above image.

Some Details In Paper

We extract a 4096-dimensional feature vector from each region proposal.

We treat all region proposals with ≥ 0.5 IoU overlap with a ground-truth box as positives for that box’s class and the rest as negatives. Setting it to 0.5, decreased mAP by 5 points.

We start SGD at a learning rate of 0.001 (1/10th of the initial pre-training rate), which allows fine-tuning to make progress while not clobbering the initialization. In each SGD iteration, we uniformly sample 32 positive windows (overall classes) and 96 background windows to construct a mini-batch of size 128.

We have found that the choice of architecture has a large effect on R-CNN detection performance.

Bounding-box regression

A simple method to reduce localization errors. Train a linear regression model to predict a new detection window given the pool5 features for a selective search region proposal.
After scoring each selective search proposal with a class-specific detection SVM,we predict a new bounding box for the detection using a class-specific bounding-box regressor.
We regress from features computed by the CNN.

Some Problems Of R-CNN

  • It takes around 47 seconds for each test image so it cant be used for real-time detection.
  • In the Region Proposal detection section(selective search) we don't have any parameter and its fixed algorithm so we cant optimize it.
  • According to thousands of region proposals from a single input image we require thousands of CNN forward propagations to perform object detection.
  • Training is expensive in space and time.
  • Training is a multi-stage pipeline.

References :