Search code examples
tensorflowmodelstore

tensorflow model: what is the difference between .data-00000-of-00002 and ,data-00001-of-00002?


When storing a tensorflow ckpt, besides the .index, .meta and checkpoint files, there are two ".data" files saved at the same time: .data-00000-of-00002 and .data-00001-of-00002. The former's size is much smaller than the latter's. My question is, why there are two data files saved and what are the differences between them?


Solution

  • According to tensorflow official page: One or more shards (<prefix>-<global_step>.data-<shard_index>-of-<number_of_shards>) contain model's weights and index file contains which weights are stored in which shard. Number of shards depend on how many machines you are using for training.

    Therefore, if you train a model on two machines, you'll have two shards with the suffix: .data-00000-of-00002 and .data-00001-of-00002

    You may want to check out this url too.

    While instantiating tf.train.Saver, you can set the value of argument sharded (default value is False). sharded=True instructs Saver to shard checkpoints for each machine/device.