python tensorflow google-colaboratory object-detection-api

How to solve "Variable is available in checkpoint, but has an incompatible shape with model variable"?

I'm trying to retrain existing pretrained net from object-detection-API. It is ssd_mobilenet_v2. Pre-trained on COCO dataset. I was reproducing steps according to the tutorial pinned to obj-detection-API.

The model starts training anyway, but the % mAP is low. I'm new to CNN's at all, so any help is appreciated.

When I start training, then this warning appears and I can't find a fix.

I'm running it in a google-collaboratory notebook with this command

# Training
!python object_detection/model_main.py \
--pipeline_config_path=${PIPELINE_CONFIG_PATH} \
--model_dir=${MODEL_DIR} \
--num_train_steps=${NUM_TRAIN_STEPS} \
--sample_1_of_n_eval_examples=$SAMPLE_1_OF_N_EVAL_EXAMPLES \
--alsologtostderrps

this are the warnings I get:

WARNING:root:Variable [FeatureExtractor/MobilenetV2/layer_19_2_Conv2d_2_3x3_s2_512/weights] is     available in checkpoint, but has an incompatible shape with model variable. Checkpoint shape: [[1, 1, 256, 512]], model variable shape: [[3, 3, 256, 512]]. This variable will not be initialized from the checkpoint.
WARNING:root:Variable [FeatureExtractor/MobilenetV2/layer_19_2_Conv2d_3_3x3_s2_256/weights] is available in checkpoint, but has an incompatible shape with model variable. Checkpoint shape: [[1, 1, 128, 256]], model variable shape: [[3, 3, 128, 256]]. This variable will not be initialized from the checkpoint.
WARNING:root:Variable [FeatureExtractor/MobilenetV2/layer_19_2_Conv2d_4_3x3_s2_256/weights] is available in checkpoint, but has an incompatible shape with model variable. Checkpoint shape: [[1, 1, 128, 256]], model variable shape: [[3, 3, 128, 256]]. This variable will not be initialized from the checkpoint.
WARNING:root:Variable [FeatureExtractor/MobilenetV2/layer_19_2_Conv2d_5_3x3_s2_128/weights] is available in checkpoint, but has an incompatible shape with model variable. Checkpoint shape: [[1, 1, 64, 128]], model variable shape: [[3, 3, 64, 128]]. This variable will not be initialized from the checkpoint.

after running like 10 minutes it prints out this:

Accumulating evaluation results...
DONE (t=1.73s).
Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.000
Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.002
Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.000
Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.000
Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.000
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.000
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.006
Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.040
Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.002
Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.026
Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.050

I haven't changed the *.ckpt files just downloaded the original pretrained version of ssd_mobilenet_v2_coco_2018_03_29 and used these and linked them in the .config file.

I'm trying to figure it out for more than a day. Thank you for help.

Solution

I recently ran into the same issue as Miroslav (exact same 4 warning messages). While @GPhilo is right that this warning message means that the checkpoint doesn't match the model, it appears that there was an issue generating this specific pre-trained checkpoint. Specifically, the ssd_mobilenet_v2_coco_2018_03_29.tar.gz checkpoint seems to have been generated using a pre-release version of the config file. Here is the link to the related issue on GitHub: https://github.com/tensorflow/models/issues/5315

In the end, I switched from the ssd_mobilenet_v2_coco.config file in the tensorflow/models git repo to the pipeline.config file included with the pre-trained checkpoint. Besides the normal settings that need changing, you also need to remove the batch_norm_trainable flag. More info on this bug is here: https://github.com/tensorflow/models/issues/4066

Note: My first attempt was to switch to the quantized version of MobileNet V2 SSD, but I didn't get the accuracy that I hoped for after re-training the model with my data set (not sure why).