deep-learning object-detection object-detection-api

From what aspect to measure the performance of an object detector?

I am on the hook to measure the prediction results of an object detector. I learned from some tutorials that when testing a trained object detector, for each object in the test image, the following information is provided:

    <object>
    <name>date</name>
    <pose>Unspecified</pose>
    <truncated>0</truncated>
    <difficult>0</difficult>
    <bndbox>
        <xmin>451</xmin>
        <ymin>182</ymin>
        <xmax>695</xmax>
        <ymax>359</ymax>
    </bndbox>
</object>

However, it is still unclear to me 1) how does these information is taken by the object detector to measure the accuracy, and 2) how does the "loss" is computed for this case. Is it something like a strict comparison? For instance, if for the object "date", I got the following outputs:

    <object>
    <name>date</name>
    <pose>Unspecified</pose>
    <truncated>0</truncated>
    <difficult>0</difficult>
    <bndbox>
        <xmin>461</xmin>  <---- different
        <ymin>182</ymin>
        <xmax>695</xmax>
        <ymax>359</ymax>
    </bndbox>
</object>

Then I will believe that my object detector made something wrong? Or they tolerant some small delta such that if the bounding box has a small drifting, then it's acceptable. But if the "label" is totally wrong, then that's wrong for sure?

This is like a "blackbox" to me and it would be great if someone can shed some lights on this. Thank you.

Solution

For object detection task. The usual performance metric is mean average precision (mAP).

1) The above information contains detected object class as well as bounding box. They are both needed for computing mAP. Here is a nice blog about how mAP is calculated. A key concept in mAP calculation is called Intersection Over Union(IoU), which specifies how much a detected bounding box overlaps with a groundtruth box. Usually a detected bounding box should at least have an IoU above a threshold (e.g. 0.5) to be counted as correctly locating an object. Based on the IoU threshold, a detection box could be labelled as 'True positive', 'TN', 'FP' and 'FN' such that further accuracy metrics can be calculated.

2) The loss in object detection task consists of two parts. The loss for classification and the loss for bounding box regression, and the total loss is usually a weighted sum of these two. (So they can tuned to focus on bounding box regression or classification)

About the example you gave, the detection result has correctly classified the object but the bounding box is not totally accurate, in this case, the classification loss is 0 while the bounding box regression loss is not. So the model knows somewhat the prediction result is still not perfect and will have to learn a bit further to give better predictions. In case the label is wrong, there is only classification loss.

The actual loss calculation is also related to IoU. There will be an IoU predefined for the model to choose which predicted bounding boxes will be selected to participate in loss computation. This is needed because usually a lot of predicted box will stack together around the same object so it is better to choose one or several of them instead all of them.