Cupy Scaling behavior timeit

I'm trying out how to calculate efficiently on gpus using cupy.

In my particular application the execution time of timeit depends on the number of runs (of course). However not linearly, but linear with a small slope first, then a large slope. See for yourself: Bi-linear increase of execution time with number of executions

My question is: Why is that?

I am not very experienced with GPU calculations or numeric internals. I just thought it would be an interesting question to ask.

Here is the code of how i measured the times

import cupy as cp
n = 401
s = 100
p = 100
x = cp.linspace(-5, 5, n, dtype=cp.float32)[:, cp.newaxis].repeat(s, 1)
sig = cp.random.uniform(.2, .4, (s, p), dtype=cp.float32)
a = cp.random.uniform(1, 2, (s, p), dtype=cp.float32)
c = cp.random.uniform(-3, 3, (s, p), dtype=cp.float32)


def cp_g(x, a, c, s):
    return cp.sum(cp.multiply(cp.exp(-cp.square(((x[...,cp.newaxis] - c) / s))), a * s / cp.sqrt(cp.float32(cp.pi))),axis=-1)


for i in cp.arange(10,1000,10):
     timeit('y= cp_g(x,a,c,sig)', globals=globals(), number=int(i))

P.S.: If interesting, the hardware i use is a GeForce 1660 Super. Cuda 10.2. Python 3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018, 04:59:51) [MSC v.1914 64 bit (AMD64)]

Solution

This behavior is easily understood!

Timeit locks until the cpu-thread returns. However this does not account for the gpu time. In order to do this one needs to add one line of code to the timeit call

timeit('y= cp_g(x,a,c,sig); cp.cuda.Device().synchronize()', ...)

Thank you Leo Fang for your comment!