I didn't think we could print anything from the GPU since calling print
inside a @cuda.jit
function doesn't work, but then I tried calling A.shape
to see what would happen.
import numpy as np
from numba import cuda
A = np.random.randn(1000, 1000)
A_gpu = cuda.to_device(A)
A_gpu.shape
(1000, 1000)
A_gpu[0][0]
0.4253498653987585
A_gpu.T
<numba.cuda.cudadrv.devicearray.DeviceNDArray at 0x7f5de810ffa0>
For something to be printed to the console, do the numbers need to be copied to the CPU first?
%timeit A.T
%timeit A_gpu.T
%timeit A.shape
%timeit A_gpu.shape
%timeit A[0][0]
%timeit A_gpu[0][0]
132 ns ± 18.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
159 ms ± 29.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
*76 ns* ± 2.37 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
*47.8 ns* ± 8.81 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
376 ns ± 146 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
161 µs ± 25.7 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Calling A.shape
is faster in the GPU for some reason, but the other functions are slower. However, it might be the case that accessing elements A[i, j]
inside a @cuda.jit
is optimized and isn't slower.
I am implementing a CUDA kernel for matrix multiplication, with the intention of using it for back propagation in neural networks, meaning dL_dX = np.dot(dL_dY, self.weights.T)
will be performed very often.
If I need to transpose a matrix, I was wondering if it's bad practice to transpose from the GPU matrix_multiplication_gpu[blocks_per_grid, threads_per_block](A_gpu, B_gpu.T)
and whether it would be better to transpose the matrix in the CPU first, and then move/"cache" the result to GPU cuda.to_device(A.T)
. Interestingly, moving the array to the GPU %timeit cuda.to_device(A.T)
is much faster 2.41 ms ± 145 µs
than transposing the array within the GPU.
Numba gpu array transpose runs a GPU kernel. That is why it is slow compared to numpy, which will generally just change the strides and not touch the underlying data.
The canonical way to perform a dot product involving a transposed matrix or matrices (dating back to the origins of Linpack and BLAS) is to change the algorithm to handle reading the input in a transposed order rather than actually transposing the input data prior to performing the product operation.