Search code examples
tensorflowtorque

Tensorflow and running distributed training with torque


I have written a neural network in line with the tensorflow guide on distributed training: https://www.tensorflow.org/deploy/distributed

If the cluster I would like to run the training on uses torque for job scheduling and distributing, how does this fit in with tensorflow and how it distributes the training over the cluster?

Do I set the training on one node in torque and let tensorflow distribute it from there, or would that clash with the functioning of torque. Is torque even necessary at all if tensorflow can handle distributions? How do I avoid clashes between the two?

Thanks in advance.


Solution

  • Torque and distributed tensorflow are responsible for different tasks that are not directly related to each other. Torque is there to distribute the resources of a cluster to multiple jobs. Within one job only the according requested resources will be available. Distributed tensorflow is there to parallelize the tensorflow task between the available resources (within one job).

    Normally you would use torque to get all the needed resources for the tensorflow task and then use distributed tensorflow to distribute the task over the resources that were provided by torque.

    If tf.train.ClusterSpec is initialized correctly with the resources made available by torque, there will be no conflicts.