tensorflow deep-learning distributed-computing tensorflow2.x

Can I use TensorFlow distribute training with heterogeneous machines?

I have two machines, machine 1 has GPUs and the machine2 only has a CPU. I want to know if the two machines can use Multi-worker training in TensorFlow, that is, during the distributed training, machine1 uses GPUs and machine2 uses CPU.

The version of Tensorflow is 2.1.0

Solution

The answer is no. When I do distribute deep learning followed this tutorial:

https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras

There are some errors happened:

tensorflow.python.framework.errors_impl.InternalError: Collective Op CollectiveBcastSend: Broadcast(1) is assigned to device /job:worker/replica:0/task:0/device:GPU:0 with type GPU and group_key 1 but that group has type CPU [Op:CollectiveBcastSend]

After I set machine1 to use CPU by code:

os.environ['CUDA_VISIBLE_DEVICES'] = '-1'

Training will run successfully using the CPUs of both machines.