Why do multiple processes accessing single GPU increase performance?

I am using PyOpenCl in combination with Python 3.7.

When calling the same kernel with multiple processes having each their own context pointing to the same GPU device, I get performance improvements which scale almost linearly with the number of processes.

I can imagine that execution of parallel processes makes some overlapping transfers possible, where a kernel of process A is executed while process B sends data to the graphic card. But this should not be responsible for such a boost in performance.

Attached you find a code example, where I implemented a dummy application where some data is decoded.

When setting n_processes=1 I get around 12 Mbit/sec, while when setting n_processes=4 I get 45 Mbit/sec.

I am using a single AMD Radeon VII graphics card.

Has anyone a good explanation for that phenomenon?

Update: I profiled the script using CodeXL. Seems like there is a lot of time wasted between kernel executions and multiple processes are able to make use of it.

import logging
import multiprocessing as mp
import pyopencl as cl
import pyopencl.array as cl_array
from mako.template import Template
import numpy as np
import time

logging.basicConfig(level=logging.INFO,
                    format='%(asctime)s %(process)d %(levelname)-8s [%(filename)s:%(lineno)d] %(message)s')

kernelsource = """
float boxplus(float a,float b)
{
float boxp=log((1+exp(a+b))/(exp(a)+exp(b)));
return boxp;
}

void kernel test(global const float* in,
                global const int* permutation_vector,
                global float* out)
{
 int gid = get_global_id(0);
 int p = gid; // permutation index
 float t = 0.0;
 for(int k=1; k<10;k++){
    p = permutation_vector[p];
    t= boxplus(in[p],in[gid]);
 }
 out[gid] = t;
}
"""


class MyProcess(mp.Process):
    def __init__(self, q):
        super().__init__()
        self.q = q

    def run(self) -> None:
        platform = cl.get_platforms()
        my_gpu_devices = [platform[0].get_devices(device_type=cl.device_type.GPU)[0]]
        ctx = cl.Context(devices=my_gpu_devices)
        queue = cl.CommandQueue(ctx)
        tpl = Template(kernelsource)
        rendered_tp = tpl.render()
        prg = cl.Program(ctx, str(rendered_tp)).build()

        size = 100000  # shape of random input array
        dtype = np.float64
        output_buffer = cl_array.empty(queue, size, dtype=dtype)
        input_buffer = cl_array.empty(queue, size, dtype=dtype)

        permutation = np.random.permutation(size)
        permutation_buffer = cl_array.to_device(queue, permutation.astype(np.int))

        def decode(data_in):
            input_buffer.set(data_in)
            for i in range(10):
                prg.test(queue, input_buffer.shape, None,
                         input_buffer.data,
                         permutation_buffer.data,
                         output_buffer.data)
            queue.finish()
            return output_buffer.get()

        counter = 1
        while True:
            data_in = np.random.normal(size=size).astype(dtype)
            data_out = decode(data_in)
            if counter % 100 == 0:
                self.q.put(size * 100)
                counter = 1
            else:
                counter += 1


def run_test_multi_cpu_single_gpu():
    q = mp.Queue()
    n_processes = 4
    for i in range(n_processes):
        MyProcess(q).start()
    t0 = time.time()
    symbols_sum = q.get()
    i = 0
    while True:
        i += 1
        print('{} Mbit/sec'.format(1 / 1e6 * symbols_sum / (time.time() - t0 + 1e-15)))
        symbols = q.get()
        symbols_sum += symbols

if __name__ == '__main__':
    run_test_multi_cpu_single_gpu()

Solution

Kernel loop has too few work. It must be almost comparable to kernel launch overhead. Kernel launch overhead is also comparable to a function call overhead in Python.

 for(int k=1; k<10;k++){
    p = permutation_vector[p];
    t= boxplus(in[p],in[gid]);
 }

This latency probably hidden behind another process's kernel launch latency and its kernel launch latency probably hidden behind a third one's function call overhead. And GPU can take even more, there are only 10 cycles of for loop with O(N) complexity. Even low end GPUs get saturated with at least thousands of iterations with O(N*N) complexity.

Also the buffer read/writes and compute are overlapping as you said.

So if the kernel takes all time in that profiling window, there is no capacity left on the graphic card?

GPU can also overlap multiple computes if it has capability and if each work is small enough to let some in-flight threads remain for others. Number of in-flight threads can be as high as 40*shaders. 40*3840 = 153600 instructions issued/pipelined per cycle(or a few cycles) or lets say 3.46 TFLOPS.

3.46 TFLOPS with even 1000 FLOP per 64bit data element, it can stream data at 3.46 GB/s rate. This is without pipelining anything in the kernel(read element 1, compute, write result, read element 2). But it does pipelining, just after starting first element compute, next batch of items are mapped on same shaders, loading new data, it can take hundreds of GB/s, which is more than PCI-e bandwidth.

Also CPU can't preprocess/post process at that rate. So there are buffer copies and CPU as bottlenecks which are hidden behind each other when there are multiple processes.