Search code examples
tensorflowdistributed

running multiple models on a distributed tensorflow train steps messd


I'm trying to build a distribute tensorflow framwork template, but there are serval problems confused me.

  1. when I used --sync_replas=True in the script,does it mean I use Synchronous training as in doc?
  2. why the global step in worker_0.log and worker_1.log is not successively increment?
  3. why the global step not start with 0 but like this

1499169072.773628: Worker 0: training step 1 done (global step: 339)

  1. what's the relation between training step and global step?

  2. As you can see from the create cluster script, I created an independent cluster.Can I run multiple different models on this cluster at the same time?


Solution

    1. Probably but depends on the particular library
    2. During distributed training it's possible to have race conditions so the increments and reads of the global step are not fully ordered. This is fine.
    3. This is probably because you're loading from a checkpoint?
    4. Unclear, depends on the library you're using
    5. One model per cluster is much easier to manage. It's fine to create multiple tf clusters on the same set of machines, though.