python tensorflow google-cloud-platform google-cloud-ml google-cloud-ml-engine

Google Cloud ML Engine "Skipping evaluation due to same checkpoint"

So I have an ML engine package based off of the census tutorial and I am trying to perform evaluation every N steps using the --min-eval-frequency flag, but I keep getting the message in stackdriver logs saying: "Skipping evaluation due to same checkpoint...". Basically, the evaluation will only happen 1x per epoch (because I guess the checkpoint eventually changes at that time). Are some additional changes needed to update the checkpoints more frequently? Any idea why this would evaluate more frequently?

Solution

Checkpoints happen with a certain frequency. If a new checkpoint has not occurred by the time a new evaluation is scheduled to occur, you'll get the message "Skipping evaluation due to same checkpoint...". This is because evaluation needs to work off of frozen weights in a separate tf.Session to avoid having weights change during evaluation, and the only way to communicate these weights between sessions is with a checkpoint. So if you want to evaluate more often and you are getting that message, increase your checkpoint frequency. You can do this by adding a flag that populates tf.contrib.learn.RunConfig#save_checkpoints_steps.