Numba Cuda computation seems to be slower than sequential run. Did I do obvious mistakes?

There are several threads covering similar topics, but unfortunately, these seem to be too complicated for me, so I would like to ask a similar question, hoping that someone will have a look at my code specifically to tell me if I got something wrong.

I am learning numba cuda right now, starting with the simple examples one can find in the net. I started with this tutorial here:

https://github.com/ContinuumIO/gtc2017-numba/blob/master/4%20-%20Writing%20CUDA%20Kernels.ipynb

which shows how to do an addition of arrays in parallel. The system configuration they used to evaluate the times is not given. For the code replication, I use a Geforce GTX 1080 Ti and an Intel Core i7 8700K CPU.

I basically copied the addition script from the tutorial, but added also sequential code for comparison:

from numba import cuda
import numpy as np
import time
import math

@cuda.jit
def addition_kernel(x, y, out):

    tx = cuda.threadIdx.x
    ty = cuda.blockIdx.x
    block_size = cuda.blockDim.x  
    grid_size = cuda.gridDim.x  

    start = tx+ ty * block_size
    stride = block_size * grid_size
    for i in range(start, x.shape[0], stride):
        out[i] = y[i] + x[i]

def add(n, x, y):
    for i in range(n):
        y[i] = y[i] + x[i]


if __name__ =="__main__":
    print(cuda.gpus[0])
    print("")
    n = 100000
    x = np.arange(n).astype(np.float32)
    y = 2 * x
    out = np.empty_like(x)
    x_device = cuda.to_device(x)
    y_device = cuda.to_device(y)
    out_device = cuda.device_array_like(x)



    # Set the number of threads in a block

    threadsperblock = 128

    # Calculate the number of thread blocks in the grid
    blockspergrid = 30#math.ceil(n[0] / threadsperblock)
    # Now start the kernel
    start = time.process_time()
    cuda.synchronize()
    addition_kernel[blockspergrid, threadsperblock](x_device, y_device, out_device)
    cuda.synchronize()
    end = time.process_time()
    out_global_mem = out_device.copy_to_host()
    print("parallel time: ", end - start)

    start = time.process_time()
    add(n,x,y)
    end = time.process_time()
    print("sequential time: ", end-start)

The parallel time is on average around 0.14 seconds, while the code without GPU kernel takes only 0.02 seconds.

This seems quite strange to me. Is there anything I did wrong? Or is this problem not a good example for parallelism? (which I do not think as you can run the for loop in parallel)

What is odd is that I do hardly notice a difference if I do not use the to_device() functions. As far as I understood, these should be important, as they avoid the communication between CPU and GPU after each iteration.

Solution

addition_kernel is compiled at runtime when it is called the first time, so in the middle of your measured time! The compilation of a kernel is a pretty intensive operation. You can force the compilation to be done eagerly (ie. when the function is defined) by providing the types to Numba.

Note that the arrays are a bit too small so you can see a big improvement on GPUs. Moreover, the comparison with the CPU version is not really fair: you should also use Numba for the CPU implementation or at least Numpy (but not an interpreted pure-CPython loop).

Here is an example:

import numba as nb

@cuda.jit('void(float32[::1], float32[::1], float32[::1])')
def addition_kernel(x, y, out):

    tx = cuda.threadIdx.x
    ty = cuda.blockIdx.x
    block_size = cuda.blockDim.x  
    grid_size = cuda.gridDim.x  

    start = tx+ ty * block_size
    stride = block_size * grid_size
    for i in range(start, x.shape[0], stride):
        out[i] = y[i] + x[i]

@nb.njit('void(int64, float32[::1], float32[::1])')
def add(n, x, y):
    for i in range(n):
        y[i] = y[i] + x[i]