Understanding Object Detection

From the history of object detection down to the inner workings of the famous Faster-RCNN

Originally published with Towards Data Science on Medium

In this article, I want to give you an overview of the history of object detectors and will explain how the architectures evolved to current state-of-the art detectors. Furthermore, I will go into detail about the inner workings of Faster R-CNN, hence it is very widely used and also part of the Tensorflow Object detection API.

Brief History of object detectors

Object detection combines the tasks of object classification and localization. Current object detectors can be divided into two categories: Networks separating the tasks of determining the location of objects and their classification, where Faster R-CNN is one of the most famous ones, and networks which predict bounding boxes and class scores at once, with the YOLO and SSD networks being famous architectures.

The first deep neural network for object detection was Overfeat [1]. They introduced a multi-scale sliding window approach using CNNs and showed that object detection also improved image classification. They were shortly followed by R-CNN: Regions with CNN features [2]. The authors proposed a model that used selective-search for generating region proposals by merging similar pixels into regions. Each region was fed into a CNN, which produced a high dimensional feature vector. This vector was then used for the final classification and bounding box regression, as shown in Figure 1.

Figure 1 The standard architecture of a R-CNN network consisting of a region proposal method, mostly selective-search, followed by a CNN for each proposal. The final classification is done with an SVM and a regressor.

Figure 1 The standard architecture of a R-CNN network consisting of a region proposal method, mostly selective-search, followed by a CNN for each proposal. The final classification is done with an SVM and a regressor.

It outperformed the Overfeat network by a large margin but was also very slow, because the proposal generation using selective-search was very time-intensive, as well as the need to feed every single proposal through a CNN. A more sophisticated approach, Fast R-CNN [3], also generated region proposals with selective-search but fed the whole image through a CNN. The region proposals were pooled directly on the feature map by ROI pooling. The pooled feature vectors were fed into a fully connected network for classification and regression, as depicted in Figure 2. Similar to R-CNN, Fast R-CNN generated the region proposals with selective-search.

Figure 2: The standard architecture of a Fast R-CNN network. The region proposals are generated by selective-search, but pooled directly on the feature map, followed by multiple FC layers for final classification and bounding box regression.

Figure 2: The standard architecture of a Fast R-CNN network. The region proposals are generated by selective-search, but pooled directly on the feature map, followed by multiple FC layers for final classification and bounding box regression.

Faster R-CNN [4] addressed this issue by proposing a novel region proposal network, that was fused with the Fast R-CNN architecture to drastically speed up the process and will be explained in greater detail in the next section. Another approach to detected object in images was R-FCN [5], the region fully convolutional network, that used position-sensitive score maps instead of a pre-region subnetwork.

The design of object detection networks was revolutionized by the YOLO [6] network. It follows completely different approach to the aforementioned models and is capable of predicting class scores and bounding boxes at once. The proposed model divided the image into a grid, where each cell predicted a confidence score of an object being present with the corresponding bounding box coordinates. This allowed YOLO real-time predictions. The authors also released two more versions, YOLO9000 [7], and YOLOv2, where the former was capable of predicting over 9000 categories, and the latter one was capable of processing larger images. Another network that predicts classes and bounding boxes at once is the single shot detector, SSD [8]. It is comparable to YOLO but used multiple aspect ratios per grid cell and more convolutional layers to improve prediction.

Faster-RCNN

Figure 3: The standard architecture of a Faster R-CNN network, where the region proposal are generated using a RPN, that works directly on the feature map. The last layers are fully connected for classification and bounding box regression.

Figure 3: The standard architecture of a Faster R-CNN network, where the region proposal are generated using a RPN, that works directly on the feature map. The last layers are fully connected for classification and bounding box regression.

The major problem of Fast R-CNN and R-CNN was the time and resource-intensive generation of region proposals. Faster R-CNN solved this by fusing Fast R-CNN with a region proposal network (RPN), as depicted in Figure 3. The RPN uses the output of the CNN as input and generates region proposals, where each proposal consists of an objectness score, as well as the object’s location. The RPN can be trained jointly with the detection network, speeding up the training and inference time. Faster R-CNN is up to 34 times faster than Fast R-CNN. In the following paragraphs each step of Faster R-CNN is explained in greater detail.

Anchors

The objective of Faster R-CNN is to detect objects as rectangular bounding boxes. These rectangles can be of varying size and scale. When previous works tried to consider objects of varying size and scale, they either created image pyramids, where multiple sizes of the image were considered, or pyramids of filters, where multiple different sized convolutional filters were applied [1, 9, 10]. These approaches worked on the input image, whereas Faster R-CNN works on the feature map from the output of a CNN and is thus creating a pyramid of anchors. An anchor is a fixed bounding box, that consists of a center point, a specific width and height, and references a bounding box on the original image. A set of anchors, which consists of multiple anchors of different combinations of sizes and scales, is generated for each position from a sliding window on the feature map. Figure 4 shows an example, where a window of size 3x3 generates k anchors, where each anchor has the same center point in the original image. This is possible, because the convolution operation is invariant to translation, and a position on the feature map can be calculated back to a region in an image.

Figure 4: Generation of k anchors per position by sliding a 3x3 window over the feature map.

Figure 4: Generation of k anchors per position by sliding a 3x3 window over the feature map.

Region Proposal Network

The RPN is a fully convolutional network and directly works on the feature map in order to generate region proposals, as depicted in figure 5. It takes the anchors as input, predicts an objectness score and performs box regression. The former is the likeliness of an anchor being an object or background, and the latter corresponds to the offset from the anchor to the actual box. Therefore, for k anchors, the RPN predicts 2k scores and 4k box regression values. The initial number of anchors does not need to be smaller than from any other region proposal method since the RPN reduces the number of anchors drastically by only considering regions that have a high objectness score. Training the RPN is not trivial. Since this is a supervised learning approach, each anchor has to be labeled either foreground or background. Therefore, every anchor has to be compared to every ground truth object by calculating their intersection over union (IoU). An anchor is considered to be foreground and positive if there exists an IoU with a groundtruth object greater than 0.7. It is considered to be background and negative if the IoU to every groundtruth box is lower than 0.3. All anchors that have an IoU between 0.3 and 0.7 are ignored. The distribution of positive and negative proposals is very imbalanced because there are way more negative than positive proposals per groundtruth box. Therefore, a minibatch with a fixed number of positives and negatives is sampled for the training process. If there are not enough positive proposals, the batch is filled with the proposals, that have the highest IoU for the respective groundtruth box. If there are still not enough positive proposals, they will be padded with negatives. The RPN employs a multitask loss to optimize both objectives simultaneously. For classification, a binary cross-entropy loss and for bounding box regression a smooth L1 loss is used. Architecture of the region proposal network, which is a fully convolutional network.

Figure 5: Standard architecture of a region proposal network.

Figure 5: Standard architecture of a region proposal network.

The regression loss is only calculated for positive anchors. Non maximum suppression (NMS) is applied after prediction to remove region proposals that have an IoU with other region proposals higher than a certain value, but a lower objectness score. After NMS, the top N proposals are selected as the final region proposals.

Region of interest pooling

The next step after the RPN is to use the region proposals in order to predict object classes and localization. Instead of the approach of R-CNN, where each proposal is fed through a classification network, Faster R-CNN makes use of the feature map to extract the features. Since a classifier expects a fixed-size input, fixed-size features are extracted for each region proposal from the feature map. In modern implementations, the feature map is cropped by the region proposal and resized to a fixed size. Max pooling extracts the most salient features afterwards, leading to a fixed size for each region proposal, which is then fed into the final stage.

Classification and Regression

The last step in Faster R-CNN is the classification of the extracted features. Therefore, the pooled features are flattened and input to two fully connected layers, which handle classification and regression. The classification layer outputs N+1 predictions, one for each of the N classes, plus one background class. The regression layer outputs 4N predictions, where each represents the regressed bounding box for each of the N classes. The training of the R-CNN module is comparable to the one of the RPN. A proposal is assigned to each groundtruth box with an IoU of greater than 0.5. Proposals with an IoU between 0.1 and 0.5 are treated as negatives and backgrounds and proposals with an IoU lower than 0.1 are ignored. Random sampling during training creates a mini-batch containing 25 percent foreground and 75 percent background proposals. The classification loss is a multiclass cross entropy loss using all proposals in the mini-batch, whereas the localization loss only uses positive proposals. In order to remove duplicates, class based NMS is applied. The final output is a list of all predicted objects with a probability higher than a specific threshold, mostly 0.5.

Conclusion

There are many different ways to detect objects in images. This blogpost showed the history from slow networks using selective search for generating region proposals to more sophisticated networks, such as Faster R-CNN. If you want to get started with object detection, I recommend the Object Detection API by Tensorflow, which mainly features the Faster R-CNN and SSD architectures. Thanks for reading!

Show Sources
  • [1] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, Overfeat: Integrated recognition, localization and detection using convolutional networks, CoRR, vol. abs/1312.6229, 2014.
  • [2] R. B. Girshick, J. Donahue, T. Darrell, and J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587, 2014.
  • [3] R. B. Girshick, Fast r-cnn, 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1440–1448, 2015.
  • [4] S. Ren, K. He, R. B. Girshick, and J. Sun, Faster r-cnn: Towards real-time object detection with region proposal networks, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, pp. 1137–1149, 2015.
  • [5] J. Dai, Y. Li, K. He, and J. Sun, R-fcn: Object detection via region-based fully convolutional networks, in NIPS, 2016.
  • [6] J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi, You only look once: Unified, real-time object detection, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 779–788, 2016.
  • [7] J. Redmon and A. Farhadi, Yolo9000: Better, faster, stronger, 2017 IEEE Confer- ence on Computer Vision and Pattern Recognition (CVPR), pp. 6517–6525, 2017.
  • [8] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C.-Y. Fu, and A. C. Berg, Ssd: Single shot multibox detector, in ECCV, 2016.
  • [9] D. A. Forsyth, Object detection with discriminatively trained part-based models, IEEE Computer, vol. 47, pp. 6–7, 2014.
  • [10] K. He, X. Zhang, S. Ren, and J. Sun, Spatial pyramid pooling in deep convolutional networks for visual recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, pp. 1904–1916, 2014.

About the author

Frederik Mattwich

Co-Founder Design AI
Frederik Mattwich is Co-Founder and CTO of Design AI, a start-up focusing on agile AI development and use case identification through Design Thinking. He is an experienced AI Engineer with background in Computer Science at Technical University Munich, focusing on Artificial Intelligence and Robotics. Besides 5+ years of experience in AI projects he has 7+ years of experience in the development of scalable software and infrastructure.