Max-pooling is useful in vision for two reasons:
By eliminating non-maximal values, it reduces computation for upper layers.
It provides a form of translation invariance. Imagine cascading a max-pooling layer with a convolutional layer. There are 8 directions in which one can translate the input image by a single pixel. If max-pooling is done over a 2x2 region, 3 out of these 8 possible configurations will produce exactly the same output at the convolutional layer. For max-pooling over a 3x3 window, this jumps to 5/8.
Since it provides additional robustness to position, max-pooling is a “smart” way of reducing the dimensionality of intermediate representations.
I can't understand, what does 8 directions
mean? And what does
"If max-pooling is done over a 2x2 region, 3 out of these 8 possible configurations will produce exactly the same output at the convolutional layer. For max-pooling over a 3x3 window, this jumps to 5/8."
mean?
There are 8 directions in which one can translate the input image by a single pixel.
They are considering 2 horizontal, 2 vertical and 4 diagonal 1-pixel shifts. That gives 8 in total.
If max-pooling is done over a 2x2 region, 3 out of these 8 possible configurations will produce exactly the same output at the convolutional layer. For max-pooling over a 3x3 window, this jumps to 5/8.
Imagine we are taking the maximum value in a 2x2 region of an image. The image is pre-convolved, though it doesn't matter for the purpose of this explanation.
No matter where exactly in a 2x2 region the maximum value resides, there will be 3 possible 1-pixel translations of the image that result in the maximum value remaining in that particular 2x2 region. Of course an even greater value may be brought from a neighbouring region, but that's beside the point. The point is you get some translation invariance.
With a 3x3 region it gets more complex, as the number of 1-pixel translations that keep the maximum value within the region depends on where exactly in the region that maximum value resides. The 5 translations they mention correspond to a location in the middle of an edge in a 3x3 pixel block. A corner location will give 3 translations, while the center one will give all 8.