When storing a tensorflow ckpt, besides the .index, .meta and checkpoint files, there are two ".data" files saved at the same time: .data-00000-of-00002 and .data-00001-of-00002. The former's size is much smaller than the latter's. My question is, why there are two data files saved and what are the differences between them?
According to tensorflow official page:
One or more shards (<prefix>-<global_step>.data-<shard_index>-of-<number_of_shards>
) contain model's weights and index file contains which weights are stored in which shard. Number of shards depend on how many machines you are using for training.
Therefore, if you train a model on two machines, you'll have two shards with the suffix: .data-00000-of-00002
and .data-00001-of-00002
You may want to check out this url too.
While instantiating tf.train.Saver
, you can set the value of argument sharded
(default value is False
). sharded=True
instructs Saver to shard checkpoints for each machine/device.