Search code examples
pythontensorflowgoogle-cloud-platformgoogle-cloud-mlgoogle-cloud-ml-engine

Google Cloud ML Engine "Skipping evaluation due to same checkpoint"


So I have an ML engine package based off of the census tutorial and I am trying to perform evaluation every N steps using the --min-eval-frequency flag, but I keep getting the message in stackdriver logs saying: "Skipping evaluation due to same checkpoint...". Basically, the evaluation will only happen 1x per epoch (because I guess the checkpoint eventually changes at that time). Are some additional changes needed to update the checkpoints more frequently? Any idea why this would evaluate more frequently?


Solution

  • Checkpoints happen with a certain frequency. If a new checkpoint has not occurred by the time a new evaluation is scheduled to occur, you'll get the message "Skipping evaluation due to same checkpoint...". This is because evaluation needs to work off of frozen weights in a separate tf.Session to avoid having weights change during evaluation, and the only way to communicate these weights between sessions is with a checkpoint. So if you want to evaluate more often and you are getting that message, increase your checkpoint frequency. You can do this by adding a flag that populates tf.contrib.learn.RunConfig#save_checkpoints_steps.