I choose matrix multiplication so that a decent amount of time is spent on the gpu, which will make it easier to fill up the launch queue.
from numba import cuda, float64
import numpy as np
import math
@cuda.jit
def matrix_multiplication(A, B, C):
i, k = cuda.grid(2)
m, n = A.shape
_, p = B.shape
if i < m and k < p:
sum_of_products = 0
for j in range(n):
sum_of_products += A[i, j] * B[j, k]
C[i, k] = sum_of_products
m = 1000
n = 1000
p = 1000
A = np.random.randn(m, n)
B = np.random.randn(n, p)
A_gpu = cuda.to_device(A)
B_gpu = cuda.to_device(B)
C_gpu = cuda.device_array((m, p))
threads_per_block = (16, 16)
blocks_per_grid = (math.ceil(C_gpu.shape[0]/threads_per_block[0]), math.ceil(C_gpu.shape[1]/threads_per_block[1]))
%%timeit -n1 cuda.synchronize()
matrix_multiplication[blocks_per_grid, threads_per_block](A_gpu, B_gpu, C_gpu)
The slowest run took 975.32 times longer than the fastest. This could mean that an intermediate result is being cached.
34.5 ms ± 83.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I launched 1000 kernels and then compiled a @cuda.jit
function.
import time
from numba import njit, prange
start = time.time()
for i in range(1000):
matrix_multiplication[blocks_per_grid, threads_per_block](A_gpu, B_gpu, C_gpu)
print(f'launched kernels: {time.time() - start}')
@njit((float64[:, :], float64[:, :], float64[:, :]))
def test(A, B, C):
for i in prange(len(A)):
for j in prange(len(B)):
print(i, j)
print(f'compiled njit: {time.time() - start}')
@cuda.jit((float64[:, :], float64[:, :], float64[:, :]))
def matrix_multiplication2(A, B, C):
i, k = cuda.grid(2)
m, n = A.shape
_, p = B.shape
if i < m and k < p:
sum_of_products = 0
for j in range(n):
sum_of_products += A[i, j] * B[j, k]
C[i, k] = sum_of_products
print(f'compiled cuda.jit: {time.time() - start}')
cuda.synchronize()
print(f'synchronized: {time.time() - start}')
launched kernels: 0.06425857543945312
compiled njit: 0.3479502201080322
compiled cuda.jit: 30.61482858657837
synchronized: 30.61592411994934
I called synchronize
to see whether the queue was empty after compiling the function. As a sanity check, I also compiled an ordinary @njit
to see whether it could be compiled while the stream is busy. A lot of time is spent compiling the function, indicating that is is waiting for the queue to free up.
Why would the compilation of gpu code need to be enqueued?
I thought it had to do with getting information about the hardware, but cuda.detect
doesn't need to be enqueued.
import time
from numba import njit, prange
start = time.time()
for i in range(1000):
matrix_multiplication[blocks_per_grid, threads_per_block](A_gpu, B_gpu, C_gpu)
print(f'launched kernels: {time.time() - start}')
cuda.detect()
print(f'detected: {time.time() - start}')
cuda.synchronize()
print(f'synchronized: {time.time() - start}')
launched kernels: 0.06768441200256348
Found 1 CUDA devices
id 0 b'Tesla T4' [SUPPORTED]
Compute Capability: 7.5
PCI Device ID: 4
PCI Bus ID: 0
UUID: GPU-d73cd393-a65d-fe16-0c5a-2e1a9666cf9c
Watchdog: Disabled
FP32/FP64 Performance Ratio: 32
Summary:
1/1 devices are supported
detected: 0.06813859939575195
synchronized: 30.339908123016357
No it isn’t “enqueued”. Compilation is done on the host by host side libraries and doesn’t interact with the GPU at all.
However, getting compiled code loaded onto the device requires use of several driver API functions which themselves require an idle GPU. If you have running operations on the GPU, they will block until the GPU is idle. That is why you see some effect from synchronisation, that makes the host wait until the GPU is idle.