I've been reading up a bit on different CNNs for object detection, and have found that most of the models I'm looking at are fully convolutional networks, like the latest YOLO versions and retinanet.
What are the benefits of FCNs over conventional CNNs with pooling, apart from FCNs having less different layers? I've read https://arxiv.org/pdf/1412.6806.pdf and as I read it the main interest of that paper was to simplify the networks structure. Is this the sole reason that modern detection/classification networks don't use pooling, or are there other benefits?
With FCNs we avoid the use of dense layers, which means less parameters and because of that we can make the network learn faster.
If you avoid pooling, your output will be of the same height/width of your input. But our goal is to reduce the size of the convolutions because it is much more computationally efficient. Also, with pooling we can go deeper, as we go through higher layers individual neurons “see” more of the input. In addition, it helps to propagate information across different scales.
Usually those networks consists of a down-sampling path to extract all the necessary features and an up-sampling path to reconstruct high-level features back to the original image dimensions.
There are some architecture like "The all convolutional net" by. Springenberg, that avoids in a sense pooling in favor of speed and simplicity. In this paper the author replaced all pooling operations with stride-2 convolutions and used a global average pooling at the output layer. The global averaging pooling operation reduce the dimension of the given input.