When I trained a SSD object detection model 20K steps using TensorFlow Object Detection API, I found that the training time varies:
It was training fast on the first 10 minutes, and around 500 steps were performed (i.e. 0.83 steps/seconds). Then it slowed down and took about 40~50 minutes to perform single training step, evaluate the model on the evaluation dataset and save the checkpoint on disk. So I interrupted the training after few steps and continued by restoring the training.
Every time, it training fast on the first 10 minutes and then slowed down sharply as the figures showed.
The model's training are implemented by TensorFlow's Estimator API tf.estimator.train_and_evaluate()
Can anyone explain how it works? How the estimator controls the training and evaluation period? I do not want to evaluate the model for every step!
If you look at the EvalSpec and TrainSpec there is an argument throttle_secs, which is responsible for deciding when evaluation is called. Refer to this heated discussion, which has many details about Estimator methods! Controlling this would be the option to control train and eval cycles. Also in general, train_and_evaluate will work by building a graph of the the training and evaluation operation. The training graph is created only once, but evaluation graph is recreated every time you need to evaluate. This means that it will load the checkpoint that was created during training, which maybe one reason why this is taking so long! Maybe InMemoryEvaluationHook that is mentioned in that discussion can help you out, since it does not reload the checkpoint everytime evaluation is called.