What does the "Test of Epoch [number]" mean in Mozilla DeepSpeech?
In the following example, it says Test of Epoch 77263
, even though there should be just 1 epoch from my understanding, since I gave --display_step 1 --limit_train 1 --limit_dev 1 --limit_test 1 --early_stop False --epoch 1
as arguments:
dernoncourt@ilcomp:~/asr/DeepSpeech$ ./DeepSpeech.py --train_files data/common-voice-v1/cv-valid-train.csv,data/common-voice-v1/cv-other-train.csv --dev_files data/common-voice-v1/cv-valid-dev.csv --test_files data/common-voice-v1/cv-valid-test.csv --decoder_library_path /asr/DeepSpeech/libctc_decoder_with_kenlm.so --fulltrace True --display_step 1 --limit_train 1 --limit_dev 1 --limit_test 1 --early_stop False --epoch 1
W Parameter --validation_step needs to be >0 for early stopping to work
I Test of Epoch 77263 - WER: 1.000000, loss: 60.50202560424805, mean edit distance: 0.894737
I --------------------------------------------------------------------------------
I WER: 1.000000, loss: 58.900837, mean edit distance: 0.894737
I - src: "how do you like her"
I - res: "i "
I --------------------------------------------------------------------------------
I WER: 1.000000, loss: 60.517113, mean edit distance: 0.894737
I - src: "how do you like her"
I - res: "i "
I --------------------------------------------------------------------------------
I WER: 1.000000, loss: 60.668221, mean edit distance: 0.894737
I - src: "how do you like her"
I - res: "i "
I --------------------------------------------------------------------------------
I WER: 1.000000, loss: 61.921925, mean edit distance: 0.894737
I - src: "how do you like her"
I - res: "i "
I --------------------------------------------------------------------------------
This is actually not a bug, as the current epoch is computed on base of your current parameters and the snapshot-persisted global step-count. Take a close look into this excerpt:
# Number of GPUs per worker - fixed for now by local reality or cluster setup gpus_per_worker = len(available_devices) # Number of batches processed per job per worker batches_per_job = gpus_per_worker * max(1, FLAGS.iters_per_worker) # Number of batches per global step batches_per_step = gpus_per_worker * max(1, FLAGS.replicas_to_agg) # Number of global steps per epoch - to be at least 1 steps_per_epoch = max(1, model_feeder.train.total_batches // batches_per_step) # The start epoch of our training # Number of GPUs per worker - fixed for now by local reality or cluster setup gpus_per_worker = len(available_devices) # Number of batches processed per job per worker batches_per_job = gpus_per_worker * max(1, FLAGS.iters_per_worker) # Number of batches per global step batches_per_step = gpus_per_worker * max(1, FLAGS.replicas_to_agg) # Number of global steps per epoch - to be at least 1 steps_per_epoch = max(1, model_feeder.train.total_batches // batches_per_step) # The start epoch of our training self._epoch = step // steps_per_epoch
So what happens is that your set-size during training differs from your current set size. Thus the strange epoch number.
Simplified example (without confusing batch size): If you once trained 5 epochs of a 1000 sample training-set, you got 5000 "global steps" (persisted as a number in your snapshot). After this training you change the command line parameters to a set of size 1 (your --limit_* parameters). "Suddenly" you'll get displayed epoch 5000, because 5000 global steps means applying a data-set of size 1 5000 times.
Take away: use the --checkpoint_dir
argument to avoid this kind of issues.