I was running a convnet model using Keras with tensorflow backend on Google Cloud, using the tensorboard
callback to save a tfevents log for the training history. When I was monitoring the learning curve I noticed that half way through the training (learning curve was on plateau), a new tfevents log was saved to disk. And TensorBoard's learning curve graph showed that the training was reset to epoch #1 with val_loss
also reset to scratch.
This is really weird. Does anyone know what is going on here? Under what circumstances would Keras automatically restart the training and save a new tfevents log?
It turned out this issue only happened when I ran my code on Google Cloud, not on my local machine. The actual cause, as confirmed by Google engineers, was Google's cloud maintenance, not Keras! Google Compute Engine (GCE) instances would occasionally be shut down for maintenance without any warnings or prior notification (also not documented at the time of this answer). The maintenance would cause the training instance to restart from scratch, therefore generating a new tfevents log and resetting all previous progress.
The solution to this is to frequently save checkpoints, load previous model if it exists, and resume training at the restart. Note that when using GCE the checkpoints have to be saved to Google Cloud Storage (GCS) using a custom Lambda callback function in Keras, otherwise your checkpoints will be gone with the shutdown.