Python Numba Cuda slower than JIT

I am currently working on accelerating some numerical processing by offloading it to the GPU. I have some demonstration code below (actual code will be more complicated). I am taking an NP array and counting how many values fall within a range.

Hardware, I am running n and AMD 3600X (6 core 12 thread) and an RTX 2060 Super (2176 cuda cores).

Sample code:

import time
import numpy as np
from numba import cuda
from numba import jit

width = 1024
height = 1024
size = width * height
print(f'Number of records {size}')

array_of_random = np.random.rand(size)
output_array = np.zeros(size, dtype=bool)
device_array = cuda.to_device(array_of_random)
device_output_array = cuda.device_array_like(output_array)


def count_array_standard(array, pivot_point, local_output_array):
    for i in range(array.shape[0]):
        if (pivot_point - 0.05) < array[i] < (pivot_point + 0.05):
            local_output_array[i] = True
        else:
            local_output_array[i] = False


@jit('(f8,b1[:])')
def count_array_jit(pivot_point, local_output_array):
    global array_of_random
    for i in range(len(array_of_random)):
        if (pivot_point - 0.05) < array_of_random[i] < (pivot_point + 0.05):
            local_output_array[i] = True
        else:
            local_output_array[i] = False


@cuda.jit()
def count_array_cuda(local_device_array, pivot_point, local_device_output_array):
    tx = cuda.threadIdx.x
    ty = cuda.blockIdx.x
    bw = cuda.blockDim.x
    pos = tx + ty * bw

    for i in range(pos, pos + bw):
        if i<local_device_output_array.size:
            if (pivot_point - 0.05) < local_device_array[i] < (pivot_point + 0.05):
                local_device_output_array[i] = True
            else:
                local_device_output_array[i] = False


print("")
print("Standard")
for x in range(3):
    start = time.perf_counter()
    count_array_standard(array_of_random, 0.5, output_array)
    result = np.sum(output_array)
    print(f'Run: {x} Result: {result} Time: {time.perf_counter() - start}')

print("")
print("Jit")
for x in range(3):
    start = time.perf_counter()
    count_array_jit(0.5, output_array)
    result = np.sum(output_array)
    print(f'Run: {x} Result: {result} Time: {time.perf_counter() - start}')

print("")
print("Cuda Jit")

threads_per_block = 16
blocks_per_grid = (array_of_random.size + (threads_per_block - 1)) // threads_per_block

for x in range(3):
    start = time.perf_counter()
    count_array_cuda[blocks_per_grid, threads_per_block](device_array, .5, device_output_array)
    result = np.sum(device_output_array.copy_to_host())
    print(f'Run: {x} Result: {result} Time: {time.perf_counter() - start}')

Gives me a set of results:

Number of records 1048576

Standard
Run: 0 Result: 104778 Time: 0.35327580000000003
Run: 1 Result: 104778 Time: 0.3521047999999999
Run: 2 Result: 104778 Time: 0.35452510000000004

Jit
Run: 0 Result: 104778 Time: 0.0020474000000001435
Run: 1 Result: 104778 Time: 0.001856599999999986
Run: 2 Result: 104778 Time: 0.0018399000000000054

Cuda Jit
Run: 0 Result: 104778 Time: 0.10867309999999986
Run: 1 Result: 104778 Time: 0.0023599000000000814
Run: 2 Result: 104778 Time: 0.002314700000000114

Both numba's basic jit and cuda jit are faster than standard code, and i do expect the initial run of the jits to take a little longer, the subsequent runs are faster with jit than with cuda. I am also seeing the optimal results for cuda when using around 16 threads, where I was expecting to require a higher thread count.

As I am new to cuda coding I am wondering if I have missed something basic. Any guidance gratefully received.

Solution

There are 2 issues that I see.

The amount of work you are doing per data item in your input array is too small to be interesting on the GPU.
The organization of threads you have chosen coupled with the for-loop in the cuda.jit routine appears to be doing redundant work.

In order to address item 1, you would probably need to do more work per item than just compare it against limits and write the result of the comparison. Alternatively, if you are really motivated by this benchmarking exercise, if you separate the data movement, you can time the kernel itself, to see what just the computation cost is.

For a simple approach to address item 2, I would get rid of the for-loop in the cuda.jit kernel, and have each thread handle 1 element in the input array. Here is an example that does that (converted to python 2.x because that is the machine setup that I had handy to play with numba):

$ cat t58.py
import time
import numpy as np
from numba import cuda
from numba import jit

width = 1024
height = 1024
size = width * height
print("Number of records")
print(size)

array_of_random = np.random.rand(size)
output_array = np.zeros(size, dtype=bool)
device_array = cuda.to_device(array_of_random)
device_output_array = cuda.device_array_like(output_array)


def count_array_standard(array, pivot_point, local_output_array):
    for i in range(array.shape[0]):
        if (pivot_point - 0.05) < array[i] < (pivot_point + 0.05):
            local_output_array[i] = True
        else:
            local_output_array[i] = False


@jit('(f8,b1[:])')
def count_array_jit(pivot_point, local_output_array):
    global array_of_random
    for i in range(len(array_of_random)):
        if (pivot_point - 0.05) < array_of_random[i] < (pivot_point + 0.05):
            local_output_array[i] = True
        else:
            local_output_array[i] = False


@cuda.jit()
def count_array_cuda(local_device_array, pivot_point, local_device_output_array):
    tx = cuda.threadIdx.x
    ty = cuda.blockIdx.x
    bw = cuda.blockDim.x
    i = tx + ty * bw
    if i<local_device_output_array.size:
        if (pivot_point - 0.05) < local_device_array[i] < (pivot_point + 0.05):
            local_device_output_array[i] = True
        else:
            local_device_output_array[i] = False


print("")
print("Standard")
for x in range(3):
    start = time.clock()
    count_array_standard(array_of_random, 0.5, output_array)
    result = np.sum(output_array)
    print(x)
    print(result)
    print(time.clock() - start)

print("")
print("Jit")
for x in range(3):
    start = time.clock()
    count_array_jit(0.5, output_array)
    result = np.sum(output_array)
    print(x)
    print(result)
    print(time.clock() - start)

print("")
print("Cuda Jit")

threads_per_block = 128
blocks_per_grid = (array_of_random.size + (threads_per_block - 1)) // threads_per_block

for x in range(3):

    start = time.clock()
    count_array_cuda[blocks_per_grid, threads_per_block](device_array, .5, device_output_array)
    cuda.synchronize()
    stop = time.clock()
    result = np.sum(device_output_array.copy_to_host())
    print(x)
    print(result)
    print(stop - start)
$ python t58.py
Number of records
1048576

Standard
0
104891
0.53704
1
104891
0.528287
2
104891
0.515948

Jit
0
104891
0.002993
1
104891
0.002635
2
104891
0.002595

Cuda Jit
0
104891
0.146518
1
104891
0.000832
2
104891
0.000813
$