A few months ago I trained a custom object detector on stanford dogs using efficientdet_d0_512x512 and only 2 classes of dogs with success. Not changing the code, I tried doing that again and the model was pumping out really low confidence scores (<1%), even though the loss in the training process was low.
I then tried resuming the training using the checkpoint generated after the initial training and the loss starts high as if the checkpoint did not exist.
I also tried working with faster rcnn, getting the same results. Here's the code: https://colab.research.google.com/drive/1fE3TYRyRrvKI2sVSQOPUaOzA9JItkuFL?usp=sharing
I'm thinking that the exporting is not working and the trained weights are not saved. Any ideas?
It's seems indeed that your checkpoint cannot be loaded as the many warnings you're getting :
WARNING:tensorflow:Unresolved object in checkpoint:
(root). .......
After a few research I found this issue on the model repository of Tensorflow Object Detection API : https://github.com/tensorflow/models/issues/8892#issuecomment-680207038.
Basically it's saying that you should change :
fine_tune_checkpoint_type = "detection"
to :
fine_tune_checkpoint_type = "fine_tune"
so that the checkpoint loaded will be used in fine-tuning and the number of classes doesn't cause issues between your configuration file and the one you're starting with. It also suggests to be careful whereas your modeldir (where the custom checkpoints and tensorboard events will be saved) is empty or not for the same reasons but it seems that you're good with that on your colab notebook.
Also on a different subject you should be careful to your learning rate, right now you're using a cosine_decay_learning_rate which requires some warmup steps, 2500 in that case. However you are using only 800 so the warmups steps are not completed when you stop the training! If for some reason you want to keep the number of steps low you should change your learning rate to a manual_step_learning_rate or exponential_decay_learning rate. Otherwise you should keep your training going for much longer.
EDIT : After further investigation the problem might be a bit deeper confers to this issue on github : https://github.com/tensorflow/models/issues/9229 You might want to keep an eye on this to see where this is going.