Search code examples
tensorflowcheckpoint

Tensorflow checkpoints saving data file


Good day everyone,

I am using tensorflow to some machine learning problem and have an undestanding problem considering the checkpoints. Saving the checkpoints produces meta, index and data file. But what do the numbers at the end of the data file mean, for example model.ckpt.data-00000-of-00001? Why is it always 00000-of-00001?


Solution

  • A tf.training.Saver when instatiated have a parameter sharded which is set per default to false.

    sharded: If True, shard the checkpoints, one per device.

    When you call save() as you can see according to the documentation:

    Returns: A string: path prefix used for the checkpoint files. If the saver is sharded, this string ends with: '-?????-of-nnnnn' where 'nnnnn' is the number of shards created. If the saver is empty, returns None.

    So if you set sharded=True and you train on several devices, for example using a cluster of GPU, or simply let's take the example of a local machine where you have a part of your model in the CPU and another part in the GPU , you'll get: data-00000-00002 and data-00001-of-00002.