Search code examples
pythontensorflowdeep-learningkerasobject-detection

Single-shot Multibox Detection: usage of variance when encoding data for training


I couldn't understand the concept of "variance" when implementing Single-shot multibox detector in code.
I am reading this and this repositories.

When training, locational input data are delta-encoded coordinates of the default box (anchor box, prior box) coordinates (Δcx, Δcy, Δw, Δh) in relation to the ground-truth bounding box coordinates.
The part I do not understand is when it encodes 0.1 to Δcx and Δcy, and 0.2 to Δw and Δh.

Why is this necessary? Or should I ask, what effects would this have on the training outcome?
I also looked into the original caffe implementation but I couldn't find much explanation there, rather than that they are encoded while training and reused to decode for inference.
I do not have much math background, but any suggestion to math theory link etc is welcome.
Thanks in advance!


Solution

  • There was a thread discussing this in the original caffe implementation and one of the repositories I was working on here.
    Author of the SSD paper says:

    You can think of it as approximating a gaussian distribution for adjusting the prior box. Or you can think of it as scaling the localization gradient. Variance is also used in original MultiBox and Fast(er) R-CNN.

    Author of the repo that I was working on says:

    Probably, the naming comes from the idea, that the ground truth bounding boxes are not always precise, in other words, they vary from image to image probably for the same object in the same position just because human labellers cannot ideally repeat themselves. Thus, the encoded values are some random values, and we want them to have unit variance that is why we divide by some value. Why they are initialized to the values used in the code - I've no idea, probably some empirical estimation by the authors.