There are several threads covering similar topics, but unfortunately, these seem to be too complicated for me, so I would like to ask a similar question, hoping that someone will have a look at my code specifically to tell me if I got something wrong.
I am learning numba cuda right now, starting with the simple examples one can find in the net. I started with this tutorial here:
https://github.com/ContinuumIO/gtc2017-numba/blob/master/4%20-%20Writing%20CUDA%20Kernels.ipynb
which shows how to do an addition of arrays in parallel. The system configuration they used to evaluate the times is not given. For the code replication, I use a Geforce GTX 1080 Ti and an Intel Core i7 8700K CPU.
I basically copied the addition script from the tutorial, but added also sequential code for comparison:
from numba import cuda
import numpy as np
import time
import math
@cuda.jit
def addition_kernel(x, y, out):
tx = cuda.threadIdx.x
ty = cuda.blockIdx.x
block_size = cuda.blockDim.x
grid_size = cuda.gridDim.x
start = tx+ ty * block_size
stride = block_size * grid_size
for i in range(start, x.shape[0], stride):
out[i] = y[i] + x[i]
def add(n, x, y):
for i in range(n):
y[i] = y[i] + x[i]
if __name__ =="__main__":
print(cuda.gpus[0])
print("")
n = 100000
x = np.arange(n).astype(np.float32)
y = 2 * x
out = np.empty_like(x)
x_device = cuda.to_device(x)
y_device = cuda.to_device(y)
out_device = cuda.device_array_like(x)
# Set the number of threads in a block
threadsperblock = 128
# Calculate the number of thread blocks in the grid
blockspergrid = 30#math.ceil(n[0] / threadsperblock)
# Now start the kernel
start = time.process_time()
cuda.synchronize()
addition_kernel[blockspergrid, threadsperblock](x_device, y_device, out_device)
cuda.synchronize()
end = time.process_time()
out_global_mem = out_device.copy_to_host()
print("parallel time: ", end - start)
start = time.process_time()
add(n,x,y)
end = time.process_time()
print("sequential time: ", end-start)
The parallel time is on average around 0.14 seconds, while the code without GPU kernel takes only 0.02 seconds.
This seems quite strange to me. Is there anything I did wrong? Or is this problem not a good example for parallelism? (which I do not think as you can run the for loop in parallel)
What is odd is that I do hardly notice a difference if I do not use the to_device()
functions. As far as I understood, these should be important, as they avoid the communication between CPU and GPU after each iteration.
addition_kernel
is compiled at runtime when it is called the first time, so in the middle of your measured time! The compilation of a kernel is a pretty intensive operation. You can force the compilation to be done eagerly (ie. when the function is defined) by providing the types to Numba.
Note that the arrays are a bit too small so you can see a big improvement on GPUs. Moreover, the comparison with the CPU version is not really fair: you should also use Numba for the CPU implementation or at least Numpy (but not an interpreted pure-CPython loop).
Here is an example:
import numba as nb
@cuda.jit('void(float32[::1], float32[::1], float32[::1])')
def addition_kernel(x, y, out):
tx = cuda.threadIdx.x
ty = cuda.blockIdx.x
block_size = cuda.blockDim.x
grid_size = cuda.gridDim.x
start = tx+ ty * block_size
stride = block_size * grid_size
for i in range(start, x.shape[0], stride):
out[i] = y[i] + x[i]
@nb.njit('void(int64, float32[::1], float32[::1])')
def add(n, x, y):
for i in range(n):
y[i] = y[i] + x[i]