I'm reading a book where a section introduces how kernel works in CNN: https://freecontent.manning.com/deep-learning-for-image-like-data/.
Sliding a kernel over an image and requiring that the whole kernel is at each position completely within the image, yields to an activation map with reduced dimensions. For example, if you’ve a 3 x 3 kernel on all sides, one pixel is knocked off in the resulting activation map; in case of a 5 x 5 kernel, even two pixels.
What does it mean here to have one or two pixels that is knocked off?
They mean, that without extra padding, using 3x3 kernel will "loose" one pixel per side in the output. So if your input image is NxN the output will be (N-2)x(N-2).
For example witn N=5 you can see that when the kernel "fits" into lower right corner its center is "one pixel off in both horizontal and vertical axes".
a a a a a . . . . .
a a a a a . b b b .
a a x x x ===> . b b b .
a a x X x . b b B .
a a x x x . . . . .
5 x 5 3 x 3
To avoid this issue various padding strategies are used, e.g. to "surround your picture" with 0s so that size is preserved
0 0 0 0 0 0 0 . . . . . . .
0 a a a a a 0 . b b b b b .
0 a a a a a 0 . b b b b b .
0 a a a a a 0 ===> . b b b b b .
0 a a a x x x . b b b b b .
0 a a a x X x . b b b b B .
0 0 0 0 x x x . . . . . . .
5 x 5 + pad 5 x 5