Search code examples
pytorchobject-detectiontorchvisionsingle-shot-detector

What does Conv4_x, Conv8_x means in SSD


In the SSD model presented in the paper, it is said that the base network is considered as VGG16 and the extra feature layers are added at the end of it that allows feature maps to be produced at different scales and aspect ratios.

My question is that in the architecture shown in Fig.2 (shown below) in the SSD paper, the convolution layers have notations shown like Conv5_3 , Conv4_3, for the base network Conv8_2, Conv9_2, Conv10_2 for the added features layers.

What does this _2, _3 notation mean in the convolution layers representation?

I have seen the same notations being used in the SSD model description page, where base network VGG16 change to ResNET50 and used notations like Conv5_x, Conv4_x.

What does this _x means for the convolution layer notation? enter image description here

(note): The SSD model and VGG16 model (till considered as base network in SSD) have same layers (see below), but resulted different output feature maps (torchinfo.summary(model,(1, 3, 300, 300)) used)enter image description here VGG16 each layer output feature mapenter image description here, SSD each layer output feature map enter image description here


Solution

  • The paper alone is arguably not very helpful in describing the notation that you are interested in; however, the corresponding code repository adds a bit more information:

    If we look at ssd_pascal.py, for example, we can see where layers of such name are created (starting from line 23):

        out_layer = "conv6_1"
        ConvBNLayer(net, from_layer, out_layer, use_batchnorm, use_relu, 256, 1, 0, 1,
            lr_mult=lr_mult)
    
        from_layer = out_layer
        out_layer = "conv6_2"
        ConvBNLayer(net, from_layer, out_layer, use_batchnorm, use_relu, 512, 3, 1, 2,
            lr_mult=lr_mult)
    

    Now, we should also take a look at the definition of ConvBNLayer in model_libs.py (starting from line 30):

    def ConvBNLayer(net, from_layer, out_layer, use_bn, use_relu, num_output,
        kernel_size, pad, stride, dilation=1, use_scale=True, lr_mult=1,
        conv_prefix='', conv_postfix='', bn_prefix='', bn_postfix='_bn',
        scale_prefix='', scale_postfix='_scale', bias_prefix='', bias_postfix='_bias',
        **bn_params):
    

    Then we can piece this information together and add a bit of guesswork:

    • Technically, conv#_1 and conv#_2 (where # is a placeholder for the actual layer number) are always two convolutional network layers, created by calling ConvBNLayer, that follow right after one another in the network architecture (thus, the output of conv#_1 is the input of conv#_2). They differ in the number of output channels (e.g. 256 vs. 512 above), the kernel size (e.g. 1 vs. 3 above), the padding amount (e.g. 0 vs. 1 above), and the stride (e.g. 1 vs. 2 above).

    • Logically, for the authors, the two layers seem to be considered as sublayers of the same network layer, thus the _1 and _2 suffix. So, conv6_1 would be convolutional layer 6, sublayer 1 and conv6_2 would be convolutional layer 6, sublayer 2. This becomes clear when looking at the figure, for example at the illustration of Conv8_2 and what is written below it:

      • Illustrated is what the authors consider the output of the 8th layer of their network: a 10x10x512 feature-space representation of the input image.
      • "Conv: 1x1x256" means, we first have a sublayer (which should go by the name Conv8_1) with kernel size 1 and 256 output channels, which is followed by …
      • "Conv: 3x3x512-s2", i.e. a sublayer (Conv8_2) with kernel size 3, 512 output channels, and a stride of 2.

    Note that the names and numbers of layers of the figure don't match the names and numbers of ssd_pascal.py (the latter ends after conv9_2, which has a different stride as Conv9_2 in the figure), but the scheme should be the same, assuming that the authors worked with a certain consistency.

    As to your final question: In the SSD description page, where they write e.g.

    • The conv5_x, avgpool, fc and softmax layers were removed from the original classification model.
    • All strides in conv4_x are set to 1x1.

    I assume that "x" simply serves as a placeholder for the suffix _1, _2 and thus should be read as follows: conv5_1 and conv5_2 were removed, all strides in conv4_1 and conv4_2 are set to 1x1.