Originally published with Towards
Data Science on Medium
In this article, I want to give you an overview of the history of object detectors and will explain how the
architectures evolved to current state-of-the art detectors. Furthermore, I will go into detail about the inner
workings of Faster R-CNN, hence it is very widely used and also part of the Tensorflow Object detection API.
Brief History of object detectors
Object detection combines the tasks of object classification and localization. Current object detectors can be divided
into two categories: Networks separating the tasks of determining the location of objects and their classification,
where Faster R-CNN is one of the most famous ones, and networks which predict bounding boxes and class scores at once,
with the YOLO and SSD networks being famous architectures.
The first deep neural network for object detection was Overfeat . They introduced a multi-scale sliding window
approach using CNNs and showed that object detection also improved image classification. They were shortly followed by
R-CNN: Regions with CNN features . The authors proposed a model that used selective-search for generating region
proposals by merging similar pixels into regions. Each region was fed into a CNN, which produced a high dimensional
feature vector. This vector was then used for the final classification and bounding box regression, as shown in Figure
Figure 1 The standard architecture of a R-CNN network consisting of a region proposal method,
mostly selective-search, followed by a CNN for each proposal. The final classification is done with an SVM and a
It outperformed the Overfeat network by a large margin but was also very slow, because the proposal generation using
selective-search was very time-intensive, as well as the need to feed every single proposal through a CNN. A more
sophisticated approach, Fast R-CNN , also generated region proposals with selective-search but fed the whole image
through a CNN. The region proposals were pooled directly on the feature map by ROI pooling. The pooled feature vectors
were fed into a fully connected network for classification and regression, as depicted in Figure 2. Similar to R-CNN,
Fast R-CNN generated the region proposals with selective-search.
Figure 2: The standard architecture of a Fast R-CNN network. The region proposals are generated
by selective-search, but pooled directly on the feature map, followed by multiple FC layers for final classification
and bounding box regression.
Faster R-CNN  addressed this issue by proposing a novel region proposal network, that was fused with the Fast R-CNN
architecture to drastically speed up the process and will be explained in greater detail in the next section. Another
approach to detected object in images was R-FCN , the region fully convolutional network, that used
position-sensitive score maps instead of a pre-region subnetwork.
The design of object detection networks was revolutionized by the YOLO  network. It follows completely different
approach to the aforementioned models and is capable of predicting class scores and bounding boxes at once. The
proposed model divided the image into a grid, where each cell predicted a confidence score of an object being present
with the corresponding bounding box coordinates. This allowed YOLO real-time predictions. The authors also released
two more versions, YOLO9000 , and YOLOv2, where the former was capable of predicting over 9000 categories, and the
latter one was capable of processing larger images. Another network that predicts classes and bounding boxes at once
is the single shot detector, SSD . It is comparable to YOLO but used multiple aspect ratios per grid cell and more
convolutional layers to improve prediction.
Figure 3: The standard architecture of a Faster R-CNN network, where the region proposal are
generated using a RPN, that works directly on the feature map. The last layers are fully connected for classification
and bounding box regression.
The major problem of Fast R-CNN and R-CNN was the time and resource-intensive generation of region proposals. Faster
R-CNN solved this by fusing Fast R-CNN with a region proposal network (RPN), as depicted in Figure 3. The RPN uses the
output of the CNN as input and generates region proposals, where each proposal consists of an objectness score, as
well as the object’s location. The RPN can be trained jointly with the detection network, speeding up the training and
inference time. Faster R-CNN is up to 34 times faster than Fast R-CNN. In the following paragraphs each step of Faster
R-CNN is explained in greater detail.
The objective of Faster R-CNN is to detect objects as rectangular bounding boxes. These rectangles can be of varying
size and scale. When previous works tried to consider objects of varying size and scale, they either created image
pyramids, where multiple sizes of the image were considered, or pyramids of filters, where multiple different sized
convolutional filters were applied [1, 9, 10]. These approaches worked on the input image, whereas Faster R-CNN works
on the feature map from the output of a CNN and is thus creating a pyramid of anchors. An anchor is a fixed bounding
box, that consists of a center point, a specific width and height, and references a bounding box on the original
image. A set of anchors, which consists of multiple anchors of different combinations of sizes and scales, is
generated for each position from a sliding window on the feature map. Figure 4 shows an example, where a window of
size 3x3 generates k anchors, where each anchor has the same center point in the original image. This is possible,
because the convolution operation is invariant to translation, and a position on the feature map can be calculated
back to a region in an image.
Figure 4: Generation of k anchors per position by sliding a 3x3 window over the feature map.
Region Proposal Network
The RPN is a fully convolutional network and directly works on the feature map in order to generate region proposals,
as depicted in figure 5. It takes the anchors as input, predicts an objectness score and performs box regression. The
former is the likeliness of an anchor being an object or background, and the latter corresponds to the offset from the
anchor to the actual box. Therefore, for k anchors, the RPN predicts 2k scores and 4k box regression values. The
initial number of anchors does not need to be smaller than from any other region proposal method since the RPN reduces
the number of anchors drastically by only considering regions that have a high objectness score. Training the RPN is
not trivial. Since this is a supervised learning approach, each anchor has to be labeled either foreground or
background. Therefore, every anchor has to be compared to every ground truth object by calculating their intersection
over union (IoU). An anchor is considered to be foreground and positive if there exists an IoU with a groundtruth
object greater than 0.7. It is considered to be background and negative if the IoU to every groundtruth box is lower
than 0.3. All anchors that have an IoU between 0.3 and 0.7 are ignored. The distribution of positive and negative
proposals is very imbalanced because there are way more negative than positive proposals per groundtruth box.
Therefore, a minibatch with a fixed number of positives and negatives is sampled for the training process. If there
are not enough positive proposals, the batch is filled with the proposals, that have the highest IoU for the
respective groundtruth box. If there are still not enough positive proposals, they will be padded with negatives. The
RPN employs a multitask loss to optimize both objectives simultaneously. For classification, a binary cross-entropy
loss and for bounding box regression a smooth L1 loss is used. Architecture of the region proposal network, which is a
fully convolutional network.
Figure 5: Standard architecture of a region proposal network.
The regression loss is only calculated for positive anchors. Non maximum suppression (NMS) is applied after
prediction to remove region proposals that have an IoU with other region proposals higher than a certain value, but a
lower objectness score. After NMS, the top N proposals are selected as the final region proposals.
Region of interest pooling
The next step after the RPN is to use the region proposals in order to predict object classes and localization.
Instead of the approach of R-CNN, where each proposal is fed through a classification network, Faster R-CNN makes use
of the feature map to extract the features. Since a classifier expects a fixed-size input, fixed-size features are
extracted for each region proposal from the feature map. In modern implementations, the feature map is cropped by the
region proposal and resized to a fixed size. Max pooling extracts the most salient features afterwards, leading to a
fixed size for each region proposal, which is then fed into the final stage.
Classification and Regression
The last step in Faster R-CNN is the classification of the extracted features. Therefore, the pooled features are
flattened and input to two fully connected layers, which handle classification and regression. The classification
layer outputs N+1 predictions, one for each of the N classes, plus one background class. The regression layer outputs
4N predictions, where each represents the regressed bounding box for each of the N classes. The training of the R-CNN
module is comparable to the one of the RPN. A proposal is assigned to each groundtruth box with an IoU of greater than
0.5. Proposals with an IoU between 0.1 and 0.5 are treated as negatives and backgrounds and proposals with an IoU
lower than 0.1 are ignored. Random sampling during training creates a mini-batch containing 25 percent foreground and
75 percent background proposals. The classification loss is a multiclass cross entropy loss using all proposals in the
mini-batch, whereas the localization loss only uses positive proposals. In order to remove duplicates, class based NMS
is applied. The final output is a list of all predicted objects with a probability higher than a specific threshold,
There are many different ways to detect objects in images. This blogpost showed the history from slow networks using
selective search for generating region proposals to more sophisticated networks, such as Faster R-CNN. If you want to
get started with object detection, I recommend the Object Detection API by Tensorflow, which mainly features the
Faster R-CNN and SSD architectures. Thanks for reading!
-  P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, Overfeat: Integrated
recognition, localization and detection using convolutional networks, CoRR, vol. abs/1312.6229, 2014.
-  R. B. Girshick, J. Donahue, T. Darrell, and J. Malik, Rich feature hierarchies for accurate object
detection and semantic segmentation, 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp.
-  R. B. Girshick, Fast r-cnn, 2015 IEEE International Conference on Computer Vision (ICCV), pp.
-  S. Ren, K. He, R. B. Girshick, and J. Sun, Faster r-cnn: Towards real-time object detection with
region proposal networks, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, pp.
-  J. Dai, Y. Li, K. He, and J. Sun, R-fcn: Object detection via region-based fully convolutional
networks, in NIPS, 2016.
-  J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi, You only look once: Unified, real-time
object detection, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 779–788, 2016.
-  J. Redmon and A. Farhadi, Yolo9000: Better, faster, stronger, 2017 IEEE Confer- ence on Computer
Vision and Pattern Recognition (CVPR), pp. 6517–6525, 2017.
-  W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C.-Y. Fu, and A. C. Berg, Ssd: Single shot
multibox detector, in ECCV, 2016.
-  D. A. Forsyth, Object detection with discriminatively trained part-based models, IEEE Computer, vol.
47, pp. 6–7, 2014.
-  K. He, X. Zhang, S. Ren, and J. Sun, Spatial pyramid pooling in deep convolutional networks for
visual recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, pp. 1904–1916,
About the author