Search code examples
tensorflowdeep-learningdistributed-computingtensorflow2.x

Can I use TensorFlow distribute training with heterogeneous machines?


I have two machines, machine 1 has GPUs and the machine2 only has a CPU. I want to know if the two machines can use Multi-worker training in TensorFlow, that is, during the distributed training, machine1 uses GPUs and machine2 uses CPU.

The version of Tensorflow is 2.1.0


Solution

  • The answer is no. When I do distribute deep learning followed this tutorial:

    https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras

    There are some errors happened:

    tensorflow.python.framework.errors_impl.InternalError: Collective Op CollectiveBcastSend: Broadcast(1) is assigned to device /job:worker/replica:0/task:0/device:GPU:0 with type GPU and group_key 1 but that group has type CPU [Op:CollectiveBcastSend]

    After I set machine1 to use CPU by code:

    os.environ['CUDA_VISIBLE_DEVICES'] = '-1'
    

    Training will run successfully using the CPUs of both machines.