computer-vision object-detection image-recognition bounding-box faster-rcnn

What is scale-invariance and log-space translations of a bounding box?

In slow R-CNN paper, the bounding box regression's goal is to learn a transformation that maps a proposed bounding box P to a ground-truth box G and we parameterize the transformation in terms of four functions dx(P),dy(P),dw(P),dh(P).

The first 2 specifies a scale-invariant translation of the center of P's bounding box, while the
2nd two specifies log-space translations of the width and height of P's bounding box relative to an object proposal.

It's the same technique used in Fast-RCNN paper too for BB prediction.!

Question1. Could anyone help me to understand the relevance of scale-invariance and log-space(both) of the bounding box and how these function capture these two aspects?

Question2. How the above mentioned BB scale-invariant translation is different from achieving scale-invariant object detection(explained below)?

I mean in fast R-CNN the author pointed out that below 2 ways are to achieve scale invariance in object detection:

First, the brute-force approach, each image is processed at a pre-defined pixel size during both training and testing. The network must directly learn scale-invariant object detection from the training data
The second approach is using image pyramids.

Please feel free to cite the research paper so that I can read for in-depth understanding.

Solution

The goal of these functions dx(P), dy(P), dw(P), dh(P) is to transform from proposal box to groundtruth box. They are modeled as linear functions of pooled features from feature maps and they contain learnable parameters (Weights).

The paper states that dx(P), dy(P) specify a scale-invariant translation of the center of P's bounding box, note that it is they specify but not they are, so what is this translation? The translation looks like this:

To understand what is scale-invariant we can start from why it is needed? Because proposal bboxes could come as different sizes. In the pic below, the person with a bat and the thrower's proposal bboxes are of different sizes, both, after ROI pooling will be represented as a fixed same shape feature vector (FIXED AND SAME SHAPE!!). When the regressor makes the prediction, it simply predicts the value dx(P) and dy(P), and doesn't differentiate which proposal bbox the feature vector is from. When applying this value to the input image, because we already have the information provided by the proposal bboxes (Px, Py, Pw, Ph), the bboxes' center in the input image can be simply calculated by the transformation! (Note both proposals are of class person so the regressor could be the same, otherwise the regressor is different)

As for the later two transformation:

If you apply log transform on both sides, you will see it is:

dw(P) and dh(P) specify a log space translation!

As for the second question, the bounding box regression is part of the whole detection pipeline and only used for bbox regression. Ohter than bbox regression, object detection also has to deal with image classification, proposal generation, etc. For example, pyramid images is applied during proposal generation.