I am working on a machine learning system using the C++ API of PyTorch (libtorch
).
One thing that I have been recently working on is researching performance, CPU utilization and GPU usage of libtorch
. Trough my research I understand that Torch utilizes two ways of parallelization on CPUs:
inter-op
parallelizationintra-op
parallelizationMy main questions are:
inter-op
parallelismI know that I can specify the number of threads used for intra-op
parallelism (which from my understanding is performed using the openmp
backend) using the torch::set_num_threads()
function, as I monitor the performance of my models, I can see clearly that it utilizes the number of threads I specify using this function, and I can see clear performance difference by changing the number of intra-op
threads.
There is also another function torch::set_num_interop_threads()
, but it seems that no matter how many interop threads I specify, I never see any difference in performance.
Now I have read this PyTorch documentation article but it is still is unclear to me how to utilize the inter op thread pool.
The docs say:
PyTorch uses a single thread pool for the inter-op parallelism, this thread pool is shared by all inference tasks that are forked within the application process.
I have two questions to this part:
interop
threads, or does torch do it somehow for me internally?interop
thread pool?In python
example they use a fork
function from torch.jit
module, but I cant find anything similar in the C++ API.
difference between these two
As one can see on this picture:
intra-op
- parallelization done for single operation (like matmul
or any other "per-tensor")inter-op
- you have multiple operations and their calculations can be intertwinedinter-op
"example":
op1
starts and returns "Future" object (which is an object we can query for result once this operation finishes)op2
starts immediately after (as op1
is non-blocking right now)op2
finishesop1
for result (hopefully finished already or at least closer to finishing)op1
and op2
results together (or whatever we'd like to do with them)Due to above:
intra-op
works without any additions (as it's PyTorch handled) and should improve the performanceinter-op
is user driven (model's architecture, forward
especially), hence architecture must be created with inter-op
in mind!how can I utilize inter-op parallelism
Unless you architectured your models with inter-op
in mind (using for example Futures
, see first code snippet in the link you posted) you won't see any performance improvements.
Most probably:
torchscript
and only inference is done in C++inter-op
code in Python, e.g. using torch.jit.fork
and torch.jit.wait
do I need to create new threads myself to utilize the interop threads, or does torch do it somehow for me internally?
Not sure if it's possible in C++ currently, can't find any torch::jit::fork
or related functionality.
If I need to create new threads myself, how do I do it in C++, so that I create a new thread form the interop thread pool?
Unlikely as C++'s API's goal is to mimick Python's API as close to reality as possible. You might have to dig a little deeper for source code related to it and/or post a feature request on their GitHub repo if needed