This is the architecture of YOLO. I am trying to calculate the output size of each layer myself, but I can't get the size as described in the paper.
For example, in the first Conv Layer, the input size is 448x448 but it uses a 7x7 filter with stride 2, but according to this equation W2=(W1−F+2P)/S+1 = (448 - 7 + 0)/2 + 1, I can't get an integer result, so the filter size seems to be unsuitable to the input size.
So anyone can explain this problem? Did I miss something or misunderstand the YOLO architecture?
As Hawx Won said, the input image has been added extra 3 paddings, and here is how it works from the source code.
For convolution layers, if pad is enabled, The padding value of each layer will be calculated by:
# In parser.c
if(pad) padding = size/2;
# In convolutional_layer.c
l.pad = padding;
Where size
is the shape of the filter.
So, for the first layer: padding = size/2 = 7/2=3
Then the output of first convolutional layer should be:
output_w = (input_w+2*pad-size)/stride+1 = (448+6-7)/2+1 = 224
output_h = (input_h+2*pad-size)/stride+1 = (448+6-7)/2+1 = 224