tensorflow deep-learning computer-vision conv-neural-network mobilenet

Mobilenet architecture

The above figure contains the mobilenet architecture, In the very first row the input size is mentioned as 224x224x3 and filter shape of 3x3x3x32 and a stride of 2. If we apply the formula for out_size = ((input_size - filter_size + 2*padding)/stride)+1,(padding = 0) we get out_size as (224-3+2(0))/2 + 1 = 111.5 , but in the second row the input size is mentioned as 112x112x32. I'm new to these concecpts, can anyone explain me where i am going wrong?

Solution

You are not wrong, without padding the output shape of the first 2D convolution layer would not be adequate.

To implement it, you must set a padding to one side of the left-right dimension, and a padding to one side of the top-bottom dimension. That way you'll have an input shape of 225x225x3, which will yield the correct output shape after 2Dconvolution of stride 2 and kernel 3x3.

With PyTorch, you can simply set padding=1 in

torch.nn.Conv2d(in_channels=3, out_channels=32, kernel_size=(3,3), stride=2, padding=1)

It will understand that padding on both side of each dimension will not be possible, and return an output of shape (112, 112, 32).