I have extensively studied other answers on TensorFlow and I just cannot seem to get it to use multiple cores on my CPU.
According to htop, the following program only uses a single CPU core:
import tensorflow as tf
n_cpus = 20
sess = tf.Session(config=tf.ConfigProto(
device_count={ "CPU": n_cpus },
inter_op_parallelism_threads=n_cpus,
intra_op_parallelism_threads=1,
))
size = 100000
A = tf.ones([size, size], name="A")
B = tf.ones([size, size], name="B")
C = tf.ones([size, size], name="C")
with tf.device("/cpu:0"):
x = tf.matmul(A, B)
with tf.device("/cpu:1"):
y = tf.matmul(A, C)
sess.run([x, y])
# run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
# run_metadata = tf.RunMetadata()
# sess.run([x, y], options=run_options, run_metadata=run_metadata)
# for device in run_metadata.step_stats.dev_stats:
# device_name = device.device
# print(device.device)
# for node in device.node_stats:
# print(" ", node.node_name)
However, when I uncomment the lines at the bottom, and change size
so that the computation actually finishes in a reasonable amount of time, I see that TensorFlow seems to think it's using at least 2 CPU devices:
/job:localhost/replica:0/task:0/device:CPU:0
_SOURCE
MatMul
_retval_MatMul_0_0
_retval_MatMul_1_0_1
/job:localhost/replica:0/task:0/device:CPU:1
_SOURCE
MatMul_1
Fundamentally, what I want to do here is execute different ops on different cores in parallel. I don't want to split a single op over multiple cores, though I know that happens to work in this contrived example. Both device_count
and inter_op_parallelism_threads
sound like what I want, but neither seems to actually result in using multiple cores. I've tried all combinations I can think of, including setting one or the other to 1
in case they conflict with each other, and nothing seems to work.
I can also confirm with taskset
that I'm not doing anything strange with my CPU affinity:
$ taskset -p $$
pid 21395's current affinity mask: ffffffffff
What exactly do I have to do to this code to get it to use multiple CPU cores?
Note:
device_count
and inter_op_parallelism_threads
.tf.device
calls and it doesn't seem to make any difference to my CPU utilization.I'm using TensorFlow 1.10.0 installed from conda.
After some back and forth on the TensorFlow issue here we determined that the issue was that the program was being "optimized" by a constant folding pass, because the inputs were all trivial. It turns out this constant folding pass runs sequentially. Therefore, if you want to observe a parallel execution, the way to do this is to make the inputs non-trivial so that the constant folding won't apply to them. The method suggested in the issue was to use tf.placeholder
, and I have written an example program that makes use of this here:
https://gist.github.com/elliottslaughter/750a27c832782f4daec8686281027de8
See the original issue for sample output from the program: https://github.com/tensorflow/tensorflow/issues/22619