image-processing computer-vision sift vlfeat

VLFeat: computation of number of octaves for SIFT

I am trying to go through and understand some of VLFeat code to see how they generate the SIFT feature points. One thing that has me baffled early on is how they compute the number of octaves in their SIFT computation.

So according to the documentation, if one provides a negative value for the initial number of octaves, it will compute the maximum which is given by log2(min(width, height)). The code for the corresponding bit is:

if (noctaves < 0) {
   noctaves = VL_MAX (floor (log2 (VL_MIN(width, height))) - o_min - 3, 1) ;
}

This code is in the function is in the vl_sift_new function. Here o_min is supposed to be the index of the first octave (I guess one does not need to start with the full resolution image). I am assuming this can be set to 0 in most use cases.

So, still I do not understand why they subtract 3 from this value. This seems very confusing. I am sure there is a good reason but I have not been able to figure it out.

Solution

The reason why they subtract by 3 is to ensure a minimum size of the patch you're looking at to get some appreciable output. In addition, when analyzing patches and extracting out features, depending on what algorithm you're looking at, there is a minimum size patch that the feature detection needs to get a good output and so subtracting by 3 ensures that this minimum patch size is met once you get to the lowest octave.

Let's take a numerical example. Let's say we have a 64 x 64 patch. We know that at each octave, the sizes of each dimension are divided by 2. Therefore, taking the log2 of the smallest of the rows and columns will theoretically give you the total number of possible octaves... as you have noticed in the above code. In our case, either the rows and columns are the minimum value, and taking the log2 of either the rows or columns gives us 7 octaves theoretically (log2(64) = 7). The octaves are arranged like so:

Octave |    Size
--------------------
   1   |  64 x 64
   2   |  32 x 32
   3   |  16 x 16
   4   |   8 x 8
   5   |   4 x 4
   6   |   2 x 2
   7   |   1 x 1

However, looking at octaves 5, 6 and 7 will probably not give you anything useful and so there's actually no point in analyzing those octaves. Therefore by subtracting by 3 from the total number of octaves, we will stop analyzing things at octave 4, and so the smallest patch to analyze is 8 x 8.

As such, this subtraction is commonly performed when looking at scale-spaces in images because this enforces that the last octave is of a good size to analyze features. The number 3 is arbitrary. I've seen people subtract by 4 and even 5. From all of the feature detection code that I have seen, 3 seems to be the most widely used number. So with what I said, it wouldn't really make much sense to look at an octave whose size is 1 x 1, right?