Search code examples
tensorflowgoogle-cloud-mltensorflow-estimator

Error when storing checkpoints to Google Cloud bucket


I run a Tensorflow model with the ML Engine on Google Cloud, and the checkpoint saver fails to save files on the bucket. I am using TensorFlow 1.4, and tf.Estimator with the method tf.estimator.train_and_evaluate.

These are the log records, where gs://e-trial-central1/models/1530351907.8359423 is the argument model_dir given for the estimator:

E  master-replica-0 Couldn't match files for checkpoint gs://e-trial-central1/models/1530351907.8359423/. 
I  master-replica-0 Create CheckpointSaverHook.  
I  master-replica-0 Restoring parameters from gs://e-trial-central1/models/1530351907.8359423/. 
W  master-replica-0 Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for gs://e-trial-central1/models/1530351907.8359423/. 
W  master-replica-0 Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for gs://e-trial-central1/models/1530351907.8359423/. 
W  master-replica-0 Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for gs://e-trial-central1/models/1530351907.8359423/. 
W  master-replica-0 Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for gs://e-trial-central1/models/1530351907.8359423/. 
W  master-replica-0 Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for gs://e-trial-central1/models/1530351907.8359423/. 
W  master-replica-0 Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for gs://e-trial-central1/models/1530351907.8359423/. 
W  master-replica-0 Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for gs://e-trial-central1/models/1530351907.8359423/. 
W  master-replica-0 Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for gs://e-trial-central1/models/1530351907.8359423/. 

Things I've already tried, following suggestions from other posts (here and here):

  1. Saving to a regional bucket (us-central1) instead of a multi-regional. This results in the same error.
  2. Using a simpler path, which doesn't include the '.' in the folder name. This results in the same error.
  3. Saving to a local path, rather than the bucket. This works! But I want the files on the bucket eventually.

In contrast to other posts, what's a bit weird here is that the checkpoint path is actually corrupted. There is '.' after the model dir instead of the Tensorflow pattern (model.ckpt). Also, after failing when I look in the model dir in the bucket there are actually files there - the TF events file, and the .index, .meta and .data... files, but the checkpoint files are not there.

Any ideas what would cause this? or what to try next?

Would appreciate any help!


Solution

  • This was solved by moving to a more recent version of Tensorflow (1.8).