machine-learning computer-vision computer-science yolo conv-neural-network

How to calculate the output size of a convoluitonal layer in YOLO?

This is the architecture of YOLO. I am trying to calculate the output size of each layer myself, but I can't get the size as described in the paper.

For example, in the first Conv Layer, the input size is 448x448 but it uses a 7x7 filter with stride 2, but according to this equation W2=(W1−F+2P)/S+1 = (448 - 7 + 0)/2 + 1, I can't get an integer result, so the filter size seems to be unsuitable to the input size.

So anyone can explain this problem? Did I miss something or misunderstand the YOLO architecture?

Solution

As Hawx Won said, the input image has been added extra 3 paddings, and here is how it works from the source code.

For convolution layers, if pad is enabled, The padding value of each layer will be calculated by:

# In parser.c
if(pad) padding = size/2;

# In convolutional_layer.c
l.pad = padding;

Where size is the shape of the filter.

So, for the first layer: padding = size/2 = 7/2=3

Then the output of first convolutional layer should be:

output_w = (input_w+2*pad-size)/stride+1 = (448+6-7)/2+1 = 224

output_h = (input_h+2*pad-size)/stride+1 = (448+6-7)/2+1 = 224