image-processing computer-vision object-detection

Histogram of Oriented Gradients

I have been reading theory about HOG descriptors for object(human) detection. But I have some questions about the implementation, which might sound like an insignificant detail.

Regarding the window that contains the blocks; should the window be moved over the image pixel by pixel where the windows overlap at each step, as illustrated here: enter image description here

or should the window be moved without causing any overlapping, as here: enter image description here

The illustrations that I have seen so far used the second approach. But, considering the detection window being size of 64x128, it is highly probable that by sliding the window over the image one cannot cover the whole image. In case of image being size of 64x255, then the last 127 pixel will not be check for object. So, first approach seems more reasonable, however, more time and cpu consuming.

Any ideas? Thank you in advance.

EDIT: I try to stick to the original paper of Dalal and Triggs. One paper that implemented the algorithm and uses the second approach can be found here: http://www.cs.bilkent.edu.tr/~cansin/projects/cs554-vision/pedestrian-detection/pedestrian-detection-paper.pdf

Solution

EDIT: Sorry -- I misunderstood your question. (Also, the answer I provided to the wrong question was in error -- I've since adjusted that below for context.)

You're asking about using the HOG descriptor for detection, not generating the HOG descriptor.

In the implementation paper you reference above, it looks like they are overlapping the detection window. The window size is 64x128, while they use a horizontal stride of 32 pixels and a vertical stride of 64. They also mention that they tried smaller stride values, but this led to a higher false positive rate (in the context of their implementation.)

On top of that, they're using 3 scales of the input image: 1, 1/2, and 1/4. They don't mention any corresponding scaling of the detection window -- I'm not sure what effect that would have from a detection standpoint. It seems that this would implicitly create overlap as well.

Original answer (corrected):

Looking at the Dalal and Triggs paper (in section 6.4) it looks like they mention both i) no block overlap, as well as ii) half- and quarter- block overlap when generating the HOG descriptor. Based on their results, it sounds like greater overlap produced better detection performance (albeit at a greater resource/processing cost).