Search code examples
pythonc++multithreadingpytorchlibtorch

Pytorch C++ (Libtroch), using inter-op parallelism


I am working on a machine learning system using the C++ API of PyTorch (libtorch).

One thing that I have been recently working on is researching performance, CPU utilization and GPU usage of libtorch. Trough my research I understand that Torch utilizes two ways of parallelization on CPUs:

  • inter-op parallelization
  • intra-op parallelization

My main questions are:

  • difference between these two
  • how can I utilize inter-op parallelism

I know that I can specify the number of threads used for intra-op parallelism (which from my understanding is performed using the openmp backend) using the torch::set_num_threads() function, as I monitor the performance of my models, I can see clearly that it utilizes the number of threads I specify using this function, and I can see clear performance difference by changing the number of intra-op threads.

There is also another function torch::set_num_interop_threads(), but it seems that no matter how many interop threads I specify, I never see any difference in performance.

Now I have read this PyTorch documentation article but it is still is unclear to me how to utilize the inter op thread pool.

The docs say:

PyTorch uses a single thread pool for the inter-op parallelism, this thread pool is shared by all inference tasks that are forked within the application process.

I have two questions to this part:

  • do I need to create new threads myself to utilize the interop threads, or does torch do it somehow for me internally?
  • If I need to create new threads myself, how do I do it in C++, so that I create a new thread form the interop thread pool?

In python example they use a fork function from torch.jit module, but I cant find anything similar in the C++ API.


Solution

  • Questions

    difference between these two

    intra_inter

    As one can see on this picture:

    • intra-op - parallelization done for single operation (like matmul or any other "per-tensor")
    • inter-op - you have multiple operations and their calculations can be intertwined

    inter-op "example":

    • op1 starts and returns "Future" object (which is an object we can query for result once this operation finishes)
    • op2 starts immediately after (as op1 is non-blocking right now)
    • op2 finishes
    • we can query op1 for result (hopefully finished already or at least closer to finishing)
    • we add op1 and op2 results together (or whatever we'd like to do with them)

    Due to above:

    • intra-op works without any additions (as it's PyTorch handled) and should improve the performance
    • inter-op is user driven (model's architecture, forward especially), hence architecture must be created with inter-op in mind!

    how can I utilize inter-op parallelism

    Unless you architectured your models with inter-op in mind (using for example Futures, see first code snippet in the link you posted) you won't see any performance improvements.

    Most probably:

    • Your models are written in Python, converted to torchscript and only inference is done in C++
    • You should write (or refactor existing) inter-op code in Python, e.g. using torch.jit.fork and torch.jit.wait

    do I need to create new threads myself to utilize the interop threads, or does torch do it somehow for me internally?

    Not sure if it's possible in C++ currently, can't find any torch::jit::fork or related functionality.

    If I need to create new threads myself, how do I do it in C++, so that I create a new thread form the interop thread pool?

    Unlikely as C++'s API's goal is to mimick Python's API as close to reality as possible. You might have to dig a little deeper for source code related to it and/or post a feature request on their GitHub repo if needed