Search code examples
tensorflowdistributedimagenet

Distributed tensorflow parameter server and workers


I was closely following the Imagenet distributed TF train example.

I am not able to understand how distribution of data takes place when this example is being run on 2 different workers? In theory, different workers should see the different part of the data. Also, what part of the code tells the parameters to pass on the parameter server? Like in the multi-gpu example, there is explicit section for the 'cpu:0'.


Solution

  • The different workers see different parts of the data by virtue of dequeuing a mini batch images from a single queue of preprocessed images. To elaborate, in the distributed setup for training the Imagenet model, the input images are preprocessed by multiple threads and the preprocessed images are stored in a single RandomShuffleQueue. You can look for tf.RandomShuffleQueue in this file to see how this is done. The multiple workers are organized as 'Inception towers' and each tower dequeues a mini batch of images from the same queue, and thus get different parts of the input. The picture here answers the second part of your question. Look for slim.variables.VariableDeviceChooser in this file. The logic there makes sure that Variable objects are assigned evenly to workers that act as parameter servers. All other workers doing the actual training fetch the variables at the beginning of a step and update them at the end of the step.