In CNN, the filters are usually set as 3x3, 5x5 spatially. Can the sizes be comparable to the image size? One reason is for reducing the number of parameters to be learnt. Apart from this, is there any other key reasons? for example, people want to detect edges first?
You answer a point of the question. Another reason is that most of these useful features may be found in more than one place in an image. So, it makes sense to slide a single kernel all over the image in the hope of extracting that feature in different parts of the image using the same kernel. If you are using big kernel, the features could be interleaved and not concretely detected.
In addition to yourself answer, reduction in computational costs is a key point. Since we use the same kernel for different set of pixels in an image, the same weights are shared across these pixel sets as we convolve on them. And as the number of weights are less than a fully connected layer, we have lesser weights to back-propagate on.