Search code examples
machine-learningneural-networkcomputer-visionconv-neural-networkobject-detection

Can we replace anchor boxes in object detection with multiple bounding box predictors?


A lot of popular and state of the art object detection algorithms like YOLO and SSD use the concepts of anchor boxes. As far as I understand for networks like YOLO v3, each output grid cell has multiple anchor boxes with different aspect ratios. For detection the network predicts offset for the anchor box with the highest overlap a the given object. Why is this used instead of having multiple bounding box predictors ( each predicting x, y, w, h and c ).


Solution

  • No, anchor boxes cannot be simply replaced by multiple bounding box predictors.

    In your description, there was a minor misunderstanding.

    For detection the network predicts offset for the anchor box with the highest overlap a the given object

    Selecting the anchor box with the highest overlap to a groundtruth only happens during training phase. As explained in the SSD paper section 2.2 Matching Strategy. Not only the highest overlap anchor boxes are selected but also the ones that has IoU bigger than 0.5.

    During prediction time, the box predictor will predict the four offsets of each anchor box together with confidences for all categories.

    Now it comes to the question of why predicting the offsets instead of box attributes (x,y, c,h).

    In short, this is related to scales. For this I agree with @viceriel's answer but here is an vivid example.

    Suppose the following two images of the same size (the left one has blue background) are fed to the predictor and we want to get the bbox for the dog. Now the red bbox in each image represent the anchor boxes, both are about perfect bbox for the dog. If we predict the offset, the box predictor only needs to predict 0 for the four offsets in both cases. While if you use multiple predictor, the model has to give two different sets of values for c and h while x and y are the same. This essentially is what @vicerial explains as predicting offsets will present a less difficult mapping for the predictor to learn.

    enter image description here

    This example also explains why anchor boxes can help improve detector's performance.