matlab image-processing computer-vision matlab-cvst

Computing Histogram of Oriented Gradients on a single point

How does the HOG (Histogram of Oriented Gradients) work on single point in an image? I am using the Computer Vision Toolbox's version of the HOG descriptor: https://www.mathworks.com/help/vision/ref/extracthogfeatures.html.

I want calculate HOG on a specific point in an image and what is the best cell-size and block-size for calculating HOG on a single point?

Solution

Please note that there is no such thing as a HOG descriptor for a single point. The HOG descriptor is a dense descriptor such that you obtain a block of pixels that surround the point of interest. The extractHOGFeatures function takes in an image and optionally some specific input coordinates that you would like to compute the HOG descriptor at. These are the (x,y) or column and row locations in the image that you'd like to compute the HOG descriptor for. You specify this as a N x 2 matrix with each row being the (x,y) coordinate of where you want the HOG descriptor to be computed.

Recall that we compute the HOG descriptor on a local patch of pixels. The default size of this local patch or the cell size is 8 x 8 as per the original pedestrian detection algorithm by Dalal and Triggs. Assuming that we are ignoring the sign of the orientation in the histogram, the default number of bins in the histogram is 9, or considering angles in 20 degrees increments to make a total of 180 degrees. For each 8 x 8 patch, we have a 9 bin histogram. You also consider a composition of cells which is a block. Each block is composed as a grid of cells, and the default in MATLAB is a 2 x2 grid, making this a 16 x 16 pixel window.

Therefore, the locations that you specify serve as the centre, and we surround a 16 x 16 window around this centre. You then compute four HOG histograms - one for each cell within the block. As a final step, we concatenate all of the histograms together to make one long histogram, so 4 x 9 = 36 elements, and we further normalize this vector to compute the final descriptor representative of that coordinate.

The output is a matrix of N rows where each row is the descriptor at the centre location specified by the corresponding row of the input location matrix. To be more specific, you will get a N x 36 matrix.

As for the optimal block size and cell size, that depends on your application but for the majority of use cases, especially in OpenCV, the default is to assume a 8 x 8 cell size with a block size of2 x 2 thus creating a 16 x 16 pixel patch. Dalal and Triggs experimented with different sized cells and blocks and found that these two sizes were the most optimal for their use case of pedestrian detection.