I am currently working on accelerating some numerical processing by offloading it to the GPU. I have some demonstration code below (actual code will be more complicated). I am taking an NP array and counting how many values fall within a range.
Hardware, I am running n and AMD 3600X (6 core 12 thread) and an RTX 2060 Super (2176 cuda cores).
Sample code:
import time
import numpy as np
from numba import cuda
from numba import jit
width = 1024
height = 1024
size = width * height
print(f'Number of records {size}')
array_of_random = np.random.rand(size)
output_array = np.zeros(size, dtype=bool)
device_array = cuda.to_device(array_of_random)
device_output_array = cuda.device_array_like(output_array)
def count_array_standard(array, pivot_point, local_output_array):
for i in range(array.shape[0]):
if (pivot_point - 0.05) < array[i] < (pivot_point + 0.05):
local_output_array[i] = True
else:
local_output_array[i] = False
@jit('(f8,b1[:])')
def count_array_jit(pivot_point, local_output_array):
global array_of_random
for i in range(len(array_of_random)):
if (pivot_point - 0.05) < array_of_random[i] < (pivot_point + 0.05):
local_output_array[i] = True
else:
local_output_array[i] = False
@cuda.jit()
def count_array_cuda(local_device_array, pivot_point, local_device_output_array):
tx = cuda.threadIdx.x
ty = cuda.blockIdx.x
bw = cuda.blockDim.x
pos = tx + ty * bw
for i in range(pos, pos + bw):
if i<local_device_output_array.size:
if (pivot_point - 0.05) < local_device_array[i] < (pivot_point + 0.05):
local_device_output_array[i] = True
else:
local_device_output_array[i] = False
print("")
print("Standard")
for x in range(3):
start = time.perf_counter()
count_array_standard(array_of_random, 0.5, output_array)
result = np.sum(output_array)
print(f'Run: {x} Result: {result} Time: {time.perf_counter() - start}')
print("")
print("Jit")
for x in range(3):
start = time.perf_counter()
count_array_jit(0.5, output_array)
result = np.sum(output_array)
print(f'Run: {x} Result: {result} Time: {time.perf_counter() - start}')
print("")
print("Cuda Jit")
threads_per_block = 16
blocks_per_grid = (array_of_random.size + (threads_per_block - 1)) // threads_per_block
for x in range(3):
start = time.perf_counter()
count_array_cuda[blocks_per_grid, threads_per_block](device_array, .5, device_output_array)
result = np.sum(device_output_array.copy_to_host())
print(f'Run: {x} Result: {result} Time: {time.perf_counter() - start}')
Gives me a set of results:
Number of records 1048576
Standard
Run: 0 Result: 104778 Time: 0.35327580000000003
Run: 1 Result: 104778 Time: 0.3521047999999999
Run: 2 Result: 104778 Time: 0.35452510000000004
Jit
Run: 0 Result: 104778 Time: 0.0020474000000001435
Run: 1 Result: 104778 Time: 0.001856599999999986
Run: 2 Result: 104778 Time: 0.0018399000000000054
Cuda Jit
Run: 0 Result: 104778 Time: 0.10867309999999986
Run: 1 Result: 104778 Time: 0.0023599000000000814
Run: 2 Result: 104778 Time: 0.002314700000000114
Both numba's basic jit and cuda jit are faster than standard code, and i do expect the initial run of the jits to take a little longer, the subsequent runs are faster with jit than with cuda. I am also seeing the optimal results for cuda when using around 16 threads, where I was expecting to require a higher thread count.
As I am new to cuda coding I am wondering if I have missed something basic. Any guidance gratefully received.
There are 2 issues that I see.
The amount of work you are doing per data item in your input array is too small to be interesting on the GPU.
The organization of threads you have chosen coupled with the for-loop in the cuda.jit routine appears to be doing redundant work.
In order to address item 1, you would probably need to do more work per item than just compare it against limits and write the result of the comparison. Alternatively, if you are really motivated by this benchmarking exercise, if you separate the data movement, you can time the kernel itself, to see what just the computation cost is.
For a simple approach to address item 2, I would get rid of the for-loop in the cuda.jit kernel, and have each thread handle 1 element in the input array. Here is an example that does that (converted to python 2.x because that is the machine setup that I had handy to play with numba):
$ cat t58.py
import time
import numpy as np
from numba import cuda
from numba import jit
width = 1024
height = 1024
size = width * height
print("Number of records")
print(size)
array_of_random = np.random.rand(size)
output_array = np.zeros(size, dtype=bool)
device_array = cuda.to_device(array_of_random)
device_output_array = cuda.device_array_like(output_array)
def count_array_standard(array, pivot_point, local_output_array):
for i in range(array.shape[0]):
if (pivot_point - 0.05) < array[i] < (pivot_point + 0.05):
local_output_array[i] = True
else:
local_output_array[i] = False
@jit('(f8,b1[:])')
def count_array_jit(pivot_point, local_output_array):
global array_of_random
for i in range(len(array_of_random)):
if (pivot_point - 0.05) < array_of_random[i] < (pivot_point + 0.05):
local_output_array[i] = True
else:
local_output_array[i] = False
@cuda.jit()
def count_array_cuda(local_device_array, pivot_point, local_device_output_array):
tx = cuda.threadIdx.x
ty = cuda.blockIdx.x
bw = cuda.blockDim.x
i = tx + ty * bw
if i<local_device_output_array.size:
if (pivot_point - 0.05) < local_device_array[i] < (pivot_point + 0.05):
local_device_output_array[i] = True
else:
local_device_output_array[i] = False
print("")
print("Standard")
for x in range(3):
start = time.clock()
count_array_standard(array_of_random, 0.5, output_array)
result = np.sum(output_array)
print(x)
print(result)
print(time.clock() - start)
print("")
print("Jit")
for x in range(3):
start = time.clock()
count_array_jit(0.5, output_array)
result = np.sum(output_array)
print(x)
print(result)
print(time.clock() - start)
print("")
print("Cuda Jit")
threads_per_block = 128
blocks_per_grid = (array_of_random.size + (threads_per_block - 1)) // threads_per_block
for x in range(3):
start = time.clock()
count_array_cuda[blocks_per_grid, threads_per_block](device_array, .5, device_output_array)
cuda.synchronize()
stop = time.clock()
result = np.sum(device_output_array.copy_to_host())
print(x)
print(result)
print(stop - start)
$ python t58.py
Number of records
1048576
Standard
0
104891
0.53704
1
104891
0.528287
2
104891
0.515948
Jit
0
104891
0.002993
1
104891
0.002635
2
104891
0.002595
Cuda Jit
0
104891
0.146518
1
104891
0.000832
2
104891
0.000813
$