I'm trying to learn distributed TensorFlow. Tried out a piece code as explained here:
with tf.device("/cpu:0"):
W = tf.Variable(tf.zeros([784, 10]))
b = tf.Variable(tf.zeros([10]))
with tf.device("/cpu:1"):
y = tf.nn.softmax(tf.matmul(x, W) + b)
loss = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1]))
Getting the following error:
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device for operation 'MatMul': Operation was explicitly assigned to /device:CPU:1 but available devices are [ /job:localhost/replica:0/task:0/cpu:0 ]. Make sure the device specification refers to a valid device. [[Node: MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/device:CPU:1"](Placeholder, Variable/read)]]
Meaning that TensorFlow does not recognize CPU:1.
I'm running on a RedHat server with 40 CPUs (cat /proc/cpuinfo | grep processor | wc -l
).
Any ideas?
Following the link in the comment:
Turns out the session should be configured to have device count > 1:
config = tf.ConfigProto(device_count={"CPU": 8})
with tf.Session(config=config) as sess:
...
Kind of shocking that I missed something so basic, and no one could pinpoint to an error which seems too obvious.
Not sure if it's a problem with me or the TensorFlow code samples and documentation. Since it's Google, I'll have to say that it's me.