Multiple host threads with ArrayFire

I have a newbie question about using multiple host threads with ArrayFire for Python. We currently have a highly parallel CPU-only code, parallelized using Open MPI and mpi4py. Each CPU thread performs large matrix multiplications, often with multiple threads multiplying simultaneously. We would like to improve performance by performing matrix multiplications on a single GPU using ArrayFire.

I am trying to figure out whether we can have multiple CPU host threads send matrix multiplication jobs to the GPU, and have the GPU perform these multiplications simultaneously. Or, must each CPU host thread wait until the GPU is idle to send a multiplication job to the GPU?

I'm having trouble finding an answer because I am not well-versed in the language of GPU computing. My impression is that certain GPUs support concurrent kernel execution, but I've been unable to determine whether our GPU (Radeon Vega 10) does.

Any general tips or resources on how to do things like this with ArrayFire for Python would be appreciated.

Solution

Matrix multiplications are very fast on GPUs. It is generally a good decision to switch to GPUs for doing matrix math. I shall answer your questions in order. Note that most of what I say here is applicable to both AMD and NVIDIA GPUs.

Yes, you can launch multiple host threads that can enqueue multiple instances of same kernel without waiting for the prior job to finish. All kernel launches are asynchronous in nature, so enqueuing the kernel to the device won't block the execution. All kernel launches will be just enqueued for future execution on the GPU. Now, the question is will all those kernels execute concurrently - that is entirely dependent upon on the resources required by a single kernel instance. If the GPU can accomodate two kernel executions at the same time, then it will do it automatically for you. The kind of resources requried by a kernel launch that determine this are number of blocks launched, shared memory, constant memory etc.

Concurrent kernel execution is entirely dependent upon how much resources are needed by a single kernel instance. Also, each kernel instance has to be launched on a separate queue(OpenCL queue) as all kernels enqueued to the same queue execute in order.

To use ArrayFire efficiently, I would advice you to go through the tutorials listed here. To setup, multiple thread solution where you can launch each kernel on a separate queue, you would probably have to concentrate on the following two sections of the tutorials, especially on the second one.

Array and Matrix Manipulation
OpenCL Interoperability

Most of the examples in the documentation is in C++, but the general principle applies to python wrapper as well. If you have questions that are specific to the python wrapper, you can post them here.