Search code examples
deep-learningcomputer-visionnotation

Deep Learning - Feature Pyramid Network - How to understand the downsampling notation?


I have a question concerning the notation of the downsampling process in the feature pyramid network (FPN) architecture. I'm not sure, whether stack overflow is actually the best place for this question. Any hints concerning better places are therefore very welcome.

My question can best be illustrated with the following image from a presentation of one of the original authors of FPN:

Encoder of an FPN

Source: http://presentations.cocodataset.org/COCO17-Stuff-FAIR.pdf, Slide 11

The scale annotations of 1 and 1/4 make sense to me. Obviously, we start at full scale and after one pooling step, we have a scale of 1/4, because we downsized by a factor of 2 in the x- and y- directions. But as far as I understand, following the same logic, at the next stage (i.e. after the next pooling), we should have a scale of 1/16. After the next step 1/64, etc. What am I missing?


Solution

  • After one polling step you will get a scale of 1/2 and not 1/4. The scale reffers to the change along an axis and not the ratio of the areas. So why you have a change of 1/4 in the beginning? As slide 11 states the drawing reffers to resnet/resnext model. If we look at the resnet model architecture we can see that first we have a convolution with 7x7 with stride 2 and following this we have polling layer with stride 2 so over all we get a 1/4 reduction per axis. In the next stages we only have the polling with stride 2 so we only get a change of factor 2. I.e 1/8, 1/16 1/32.