Scratch book: Object detection with RCNN, Fast-RCNN and Faster-RCNN

Leo Pham
3 min readFeb 16, 2021

All the following content is summary of a Jonathan Hui’s blog post in the reference. For studying purpose only.

RCNN — Regional-base Convolution Neural Networks

Region proposal method uses selective search, starts with masking a relatively large number of regions, then applying SVM to combining to a sufficiently small segments. Each of them will represent a region of interest.

R-CNN makes use of a region proposal method to create about 2000 ROIs (regions of interest). The regions are warped into fixed size images and feed into a CNN network individually. It is then followed by fully connected layers to classify the object and to refine the boundary box.

Use region proposals, CNN, affine layers to locate objects. - Ref 1

Fast RCNN

Instead of extracting features for each image patch from scratch, we use a feature extractor (a CNN) to extract features for the whole image first. We also use an external region proposal method, like the selective search, to create ROIs which later combine with the corresponding feature maps to form patches for object detection. We warp the patches to a fixed size using ROI pooling and feed them to fully connected layers for classification and localization (detecting the location of the object). By not repeating the feature extractions, Fast R-CNN cuts down the process time significantly.

Fast RCNN still uses the regional proposal from the original image, but the convolution neural network does not extracting features from each region. It applies conv layers from the original image and get features map as a whole, then the region proposal combines its ROIs with feature map. In RCNN, it resizes the ROIs but in Fast RCNN, it uses ROI Pooling before feeding into FCN.

Apply region proposal on feature maps and form fixed size patches using ROI pooling. — Ref 1

Faster R-CNN — Why do we call it two-stage detector?

Faster R-CNN adopts similar design as the Fast R-CNN except it replaces the region proposal method by an internal deep network and the ROIs are derived from the feature maps instead.

The external region proposal is replaced by an internal deep network.— Ref 1

The reason why we call it two-stage is that it has a sub network inside, region proposal network (RPN). The region proposal network takes the output feature maps from the first convolutional network as input. Then feed forward through many conv layers, the output will go into two separate FCN, one for predicting the bboxes, the other for classifying the bboxes has objects or background region.

Reference:

  1. Jonathan Hui: https://jonathan-hui.medium.com/what-do-we-learn-from-region-based-object-detectors-faster-r-cnn-r-fcn-fpn-7e354377a7c9

--

--