deep-learning computer-vision pytorch retinanet

Can anyone explain about the "non-sliding window" statement in Feature Pyramid Networks for Object Detection paper?

Feature Pyramid Networks for Object Detection adopt RPN technique to create the detector, and it use sliding window technique to classify. How come there is a statement for "non-sliding window" in 5.2 section?

The extended statement in the paper : 5.2. Object Detection with Fast/Faster R-CNN Next we investigate FPN for region-based (non-sliding window) detectors.

In my understanding, FPN using sliding window in detection task. This is also mentioned in https://medium.com/@jonathan_hui/understanding-feature-pyramid-networks-for-object-detection-fpn-45b227b9106c the statement is

"FPN extracts feature maps and later feeds into a detector, says RPN, for object detection. RPN applies a sliding window over the feature maps to make predictions on the objectness (has an object or not) and the object boundary box at each location."

Thank you in advanced.

Solution

Feature Pyramid Networks(FPN) for Object Detection is not an RPN.

FPN is just a better way to do feature extraction. It incorporates features from several stages together which gives better features for the rest of the object detection pipeline (specifically because it incorporates features from the first stages which gives better features for detection of small/medium size objects).

As the original paper states: "Our goal is to leverage a ConvNet’s pyramidal feature hierarchy, which has semantics from low to high levels, and build a feature pyramid with high-level semantics throughout. The resulting Feature Pyramid Network is general purpose and in this paper we focus on sliding window proposers (Region Proposal Network, RPN for short) [29] and region-based detectors (Fast R-CNN)"

So they use it to check "Two stage" object detection pipeline. The first stage is the RPN and this is what they check in section 5.1 and then they check it for the classification stage in section 5.2.

Fast R-CNN Faster R-CNN etc.. are region based object detectors and not sliding window detectors. They get a fixed set of regions from the RPN to classify and thats it.

A good explanation on the differences you can see at https://medium.com/@jonathan_hui/what-do-we-learn-from-region-based-object-detectors-faster-r-cnn-r-fcn-fpn-7e354377a7c9.