I am modifying deeplab Network. I added a node to the mobilenet-v3 feature extractor's first layer, which reused the existing variables. As no extra parameters would be needed, I could theoretically load the old checkpoint.
Here comes the situation I couldn't understand:
when I start training in a new empty folder, load checkpoint like this:
python "${WORK_DIR}"/train.py \
#--didn't change other parameters \
--train_logdir="${EXP_DIR}/train" \
--fine_tune_batch_norm=true \
--tf_initial_checkpoint="init/deeplab/model.ckpt"
I get an Error:
ValueError: Total size of new array must be unchanged for MobilenetV3/Conv/BatchNorm/gamma lh_shape: [(16,)], rh_shape: [(480,)]
BUT, if I start training in a new empty folder, don't load any checkpoint:
python "${WORK_DIR}"/train.py \
#--didn't change other parameters \
--train_logdir="${EXP_DIR}/train" \
--fine_tune_batch_norm=false \
#--tf_initial_checkpoint="init/deeplab/model.ckpt" #i.e. no checkpoint
I could smoothly start the training.
Which made me more confusing is that, if in the same folder(which has been the train_logdir without checkpoint loaded), I try to start training with checkpoint, I could also start the training without error:
# same code as the first code block
python "${WORK_DIR}"/train.py \
#--didn't change other parameters \
--train_logdir="${EXP_DIR}/train" \
--fine_tune_batch_norm=true \
--tf_initial_checkpoint="init/deeplab/model.ckpt"
How could this happen? The --train_logdir could somehow store the shape of Batch Normalization parameters from last training?
I found the following code in train_utils.py: (Line 203)
if tf.train.latest_checkpoint(train_logdir):
tf.logging.info('Ignoring initialization; other checkpoint exists')
return None
tf.logging.info('Initializing model from path: %s', tf_initial_checkpoint)
It will try to load from the existing checkpoints in the train_logdir before trying to load the given checkpoint in the "tf_initial_checkpoint" flag.
So when I start the training the second time, the Network has loaded the variables from the first training, which has nothing to do with my pre-trained checkpoint.
My experiments also showed that start the training twice like me doesn't have good result as when I correctly load the pre-trained checkpoint.