python c++multithreading pytorch libtorch

Pytorch C++ (Libtroch), using inter-op parallelism

I am working on a machine learning system using the C++ API of PyTorch (libtorch).

One thing that I have been recently working on is researching performance, CPU utilization and GPU usage of libtorch. Trough my research I understand that Torch utilizes two ways of parallelization on CPUs:

inter-op parallelization
intra-op parallelization

My main questions are:

difference between these two
how can I utilize inter-op parallelism

I know that I can specify the number of threads used for intra-op parallelism (which from my understanding is performed using the openmp backend) using the torch::set_num_threads() function, as I monitor the performance of my models, I can see clearly that it utilizes the number of threads I specify using this function, and I can see clear performance difference by changing the number of intra-op threads.

There is also another function torch::set_num_interop_threads(), but it seems that no matter how many interop threads I specify, I never see any difference in performance.

Now I have read this PyTorch documentation article but it is still is unclear to me how to utilize the inter op thread pool.

The docs say:

PyTorch uses a single thread pool for the inter-op parallelism, this thread pool is shared by all inference tasks that are forked within the application process.

I have two questions to this part:

do I need to create new threads myself to utilize the interop threads, or does torch do it somehow for me internally?
If I need to create new threads myself, how do I do it in C++, so that I create a new thread form the interop thread pool?

In python example they use a fork function from torch.jit module, but I cant find anything similar in the C++ API.

Solution

Questions

difference between these two

As one can see on this picture:

intra-op - parallelization done for single operation (like matmul or any other "per-tensor")
inter-op - you have multiple operations and their calculations can be intertwined

inter-op "example":

op1 starts and returns "Future" object (which is an object we can query for result once this operation finishes)
op2 starts immediately after (as op1 is non-blocking right now)
op2 finishes
we can query op1 for result (hopefully finished already or at least closer to finishing)
we add op1 and op2 results together (or whatever we'd like to do with them)

Due to above:

intra-op works without any additions (as it's PyTorch handled) and should improve the performance
inter-op is user driven (model's architecture, forward especially), hence architecture must be created with inter-op in mind!

how can I utilize inter-op parallelism

Unless you architectured your models with inter-op in mind (using for example Futures, see first code snippet in the link you posted) you won't see any performance improvements.

Most probably:

Your models are written in Python, converted to torchscript and only inference is done in C++
You should write (or refactor existing) inter-op code in Python, e.g. using torch.jit.fork and torch.jit.wait

do I need to create new threads myself to utilize the interop threads, or does torch do it somehow for me internally?

Not sure if it's possible in C++ currently, can't find any torch::jit::fork or related functionality.

If I need to create new threads myself, how do I do it in C++, so that I create a new thread form the interop thread pool?

Unlikely as C++'s API's goal is to mimick Python's API as close to reality as possible. You might have to dig a little deeper for source code related to it and/or post a feature request on their GitHub repo if needed