Calling functions on arrays moved to GPU with Numba

I didn't think we could print anything from the GPU since calling print inside a @cuda.jit function doesn't work, but then I tried calling A.shape to see what would happen.

import numpy as np
from numba import  cuda

A = np.random.randn(1000, 1000)
A_gpu = cuda.to_device(A)

A_gpu.shape

(1000, 1000)

A_gpu[0][0]

0.4253498653987585

A_gpu.T

<numba.cuda.cudadrv.devicearray.DeviceNDArray at 0x7f5de810ffa0>

For something to be printed to the console, do the numbers need to be copied to the CPU first?

%timeit A.T
%timeit A_gpu.T
%timeit A.shape
%timeit A_gpu.shape
%timeit A[0][0]
%timeit A_gpu[0][0]

132 ns ± 18.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
159 ms ± 29.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
*76 ns* ± 2.37 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
*47.8 ns* ± 8.81 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
376 ns ± 146 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
161 µs ± 25.7 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Calling A.shape is faster in the GPU for some reason, but the other functions are slower. However, it might be the case that accessing elements A[i, j] inside a @cuda.jit is optimized and isn't slower.

I am implementing a CUDA kernel for matrix multiplication, with the intention of using it for back propagation in neural networks, meaning dL_dX = np.dot(dL_dY, self.weights.T) will be performed very often.

If I need to transpose a matrix, I was wondering if it's bad practice to transpose from the GPU matrix_multiplication_gpu[blocks_per_grid, threads_per_block](A_gpu, B_gpu.T) and whether it would be better to transpose the matrix in the CPU first, and then move/"cache" the result to GPU cuda.to_device(A.T). Interestingly, moving the array to the GPU %timeit cuda.to_device(A.T) is much faster 2.41 ms ± 145 µs than transposing the array within the GPU.

Solution

Numba gpu array transpose runs a GPU kernel. That is why it is slow compared to numpy, which will generally just change the strides and not touch the underlying data.

The canonical way to perform a dot product involving a transposed matrix or matrices (dating back to the origins of Linpack and BLAS) is to change the algorithm to handle reading the input in a transposed order rather than actually transposing the input data prior to performing the product operation.