Why extract SIFT features on patches instead of the whole image?

Some people extract SIFT features from patches of an image, such as "128-dimensional SIFT descriptors were computed over 16×16 pixel patches, sampled densely over a grid with a regular spacing of 8 pixels in both the horizontal and vertical directions".

Why don't they extract SIFT from original images directly? What's the advantage of extracting SIFT from patches of original images like this?

Thanks!

Solution

First I want to say that a SIFT feature IS a 128-dimensional descriptor. The 128 dimensions are calculated using a 16x16 neighborhood containing the actually point of interest (the extrema obtained from the DoG). That is pretty concrete (more information and links to Lowes papers)

The objective part is why they would sample over a grid with a regular 8x8 spacing? The only reasons I can think of are really to reduce computation time

Create a known number of descriptors. IF the image is MxN then Number of descriptors = (M/8) x (N/8) Running SIFT on a whole image may produce many descriptors clustered together. And can potentially be unbounded. Since each descriptor is expensive to compute, reducing the number would reduce the computation time. Even a small 100x100 image could have hundreds of descriptors. This method would reduce that to ~144
Finding keypoints is actually a intensive task on its own. It involves checking every single voxel of the DoG pyramid and checking for extrema (a max or min) centered in the voxel (for each octave and every 3 scales of the DoG). If you can skip this step and just assume everyan 8x8 grid spacing you would eliminate the costly operation of going through the entire DoG and all octaves and scales.

again, these are only my opinions but I hope it helps you out a little