Search code examples
tensorflowmachine-learningdeep-learningobject-detectionobject-detection-api

SSD Inception v2. Is the VGG16 feature extractor replaced by the Inception v2?


In the original SSD paper they used a VGG16 network to the the feature extraction. I am using the SSD Inception v2 model from the TensorFlow model zoo and I do not know what the difference in architecture is. This stack overflow post suggest that for other models like SSD MobileNet the VGG16 feature extractor is replaced by the MobileNet feature extractor.

I thought this would be the same case here with the SSD Inception but this paper has me confused. From here it seems that the Inception is added to the SSD part of the model and the VGG16 feature extractor remains in the beginning of the architecture. Figure from the paper - Inception Single Shot MultiBox Detector for object detection

What is the architecture of the SSD Inception v2 model?


Solution

  • In tensorflow object detection api, the ssd_inception_v2 model uses inception_v2 as the feature extractor, namely, the vgg16 part in the first figure (figure (a)) is replaced with inception_v2.

    In ssd models, the feature layer extracted by feature extractor (i.e. vgg16, inception_v2, mobilenet) will be further processed to produce extra feature layers of different resolutions. In the above figure (a), there are 6 output feature layers, the first two (19x19) are directly taken from the feature extractor. How are the other 4 layers (10x10, 5x5, 3x3, 1x1) generated?

    They are generated by extra convolutional operations (these conv operations are sort of like using very shallow feature extractors, aren't they?). The implementation details are here provided with good documents. In the documentation it says

    Note that the current implementation only supports generating new layers 
    using convolutions of stride 2 (resulting in a spatial resolution reduction 
    by a factor of 2)
    

    that is how the extra feature map decreases by a factor of 2, and if you read the function multi_resolution_feature_maps, you will find slim.conv2d operations being used, which indicates these extra layers are obtained with extra convolution layer (just one layer each!).

    Now we can explain what is improved in the paper you linked. They proposed to replace the extra feature layers with inception block. There is no inception_v2 model but simply a inception block. The paper reported improving classification accuracy by using inception block.

    Now it should be clear to the question, ssd model with vgg16, inceptioin_v2 or mobilenet are alright but the inception in the paper only refers to a inception block, not the inception network.