Scratch book: Object detection with RCNN, Fast-RCNN and Faster-RCNN
All the following content is summary of a Jonathan Hui’s blog post in the reference. For studying purpose only.
RCNN — Regional-base Convolution Neural Networks
Region proposal method uses selective search, starts with masking a relatively large number of regions, then applying SVM to combining to a sufficiently small segments. Each of them will represent a region of interest.
R-CNN makes use of a region proposal method to create about 2000 ROIs (regions of interest). The regions are warped into fixed size images and feed into a CNN network individually. It is then followed by fully connected layers to classify the object and to refine the boundary box.
Instead of extracting features for each image patch from scratch, we use a feature extractor (a CNN) to extract features for the whole image first. We also use an external region proposal method, like the selective search, to create ROIs which later combine with the corresponding feature maps to form patches for object detection. We warp the patches to a fixed size using ROI pooling and feed them to fully connected layers for classification and localization (detecting the location of the object). By not repeating the feature extractions, Fast R-CNN cuts down the process time significantly.
Fast RCNN still uses the regional proposal from the original image, but the convolution neural network does not extracting features from each region. It applies conv layers from the original image and get features map as a whole, then the region proposal combines its ROIs with feature map. In RCNN, it resizes the ROIs but in Fast RCNN, it uses ROI Pooling before feeding into FCN.
Faster R-CNN — Why do we call it two-stage detector?
Faster R-CNN adopts similar design as the Fast R-CNN except it replaces the region proposal method by an internal deep network and the ROIs are derived from the feature maps instead.
The reason why we call it two-stage is that it has a sub network inside, region proposal network (RPN). The region proposal network takes the output feature maps from the first convolutional network as input. Then feed forward through many conv layers, the output will go into two separate FCN, one for predicting the bboxes, the other for classifying the bboxes has objects or background region.