Search code examples
machine-learningcomputer-visioncomputer-scienceyoloconv-neural-network

How to calculate the output size of a convoluitonal layer in YOLO?


YOLO Architecture

This is the architecture of YOLO. I am trying to calculate the output size of each layer myself, but I can't get the size as described in the paper.

For example, in the first Conv Layer, the input size is 448x448 but it uses a 7x7 filter with stride 2, but according to this equation W2=(W1−F+2P)/S+1 = (448 - 7 + 0)/2 + 1, I can't get an integer result, so the filter size seems to be unsuitable to the input size.

So anyone can explain this problem? Did I miss something or misunderstand the YOLO architecture?


Solution

  • As Hawx Won said, the input image has been added extra 3 paddings, and here is how it works from the source code.


    For convolution layers, if pad is enabled, The padding value of each layer will be calculated by:

    # In parser.c
    if(pad) padding = size/2;
    
    # In convolutional_layer.c
    l.pad = padding;
    

    Where size is the shape of the filter.


    So, for the first layer: padding = size/2 = 7/2=3

    Then the output of first convolutional layer should be:

    output_w = (input_w+2*pad-size)/stride+1 = (448+6-7)/2+1 = 224

    output_h = (input_h+2*pad-size)/stride+1 = (448+6-7)/2+1 = 224