Search code examples
tensorflowdeep-learningcomputer-visionconv-neural-networkmobilenet

Mobilenet architecture


enter image description here

The above figure contains the mobilenet architecture, In the very first row the input size is mentioned as 224x224x3 and filter shape of 3x3x3x32 and a stride of 2. If we apply the formula for out_size = ((input_size - filter_size + 2*padding)/stride)+1,(padding = 0) we get out_size as (224-3+2(0))/2 + 1 = 111.5 , but in the second row the input size is mentioned as 112x112x32. I'm new to these concecpts, can anyone explain me where i am going wrong?


Solution

  • You are not wrong, without padding the output shape of the first 2D convolution layer would not be adequate.

    To implement it, you must set a padding to one side of the left-right dimension, and a padding to one side of the top-bottom dimension. That way you'll have an input shape of 225x225x3, which will yield the correct output shape after 2Dconvolution of stride 2 and kernel 3x3.

    With PyTorch, you can simply set padding=1 in

    torch.nn.Conv2d(in_channels=3, out_channels=32, kernel_size=(3,3), stride=2, padding=1)
    

    It will understand that padding on both side of each dimension will not be possible, and return an output of shape (112, 112, 32).