I run a Tensorflow model with the ML Engine on Google Cloud, and the checkpoint saver fails to save files on the bucket. I am using TensorFlow 1.4, and tf.Estimator
with the method tf.estimator.train_and_evaluate
.
These are the log records, where gs://e-trial-central1/models/1530351907.8359423
is the argument model_dir
given for the estimator:
E master-replica-0 Couldn't match files for checkpoint gs://e-trial-central1/models/1530351907.8359423/.
I master-replica-0 Create CheckpointSaverHook.
I master-replica-0 Restoring parameters from gs://e-trial-central1/models/1530351907.8359423/.
W master-replica-0 Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for gs://e-trial-central1/models/1530351907.8359423/.
W master-replica-0 Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for gs://e-trial-central1/models/1530351907.8359423/.
W master-replica-0 Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for gs://e-trial-central1/models/1530351907.8359423/.
W master-replica-0 Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for gs://e-trial-central1/models/1530351907.8359423/.
W master-replica-0 Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for gs://e-trial-central1/models/1530351907.8359423/.
W master-replica-0 Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for gs://e-trial-central1/models/1530351907.8359423/.
W master-replica-0 Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for gs://e-trial-central1/models/1530351907.8359423/.
W master-replica-0 Not found: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for gs://e-trial-central1/models/1530351907.8359423/.
Things I've already tried, following suggestions from other posts (here and here):
In contrast to other posts, what's a bit weird here is that the checkpoint path is actually corrupted. There is '.' after the model dir instead of the Tensorflow pattern (model.ckpt
).
Also, after failing when I look in the model dir in the bucket there are actually files there - the TF events file, and the .index
, .meta
and .data...
files, but the checkpoint files are not there.
Any ideas what would cause this? or what to try next?
Would appreciate any help!
This was solved by moving to a more recent version of Tensorflow (1.8).