Why is accessing elements of an array slower in the GPU than the CPU with Numba?

Since we can't call print inside @cuda.jit and trying to print cuda.to_device(A) results in <numba.cuda.cudadrv.devicearray.DeviceNDArray at 0x7f2c5c0605e0>, I didn't think we could print anything from the GPU. However, we can print a single element.

import numpy as np
from numba import  cuda

A = np.random.randn(1000, 1000)
A_gpu = cuda.to_device(A)
A_gpu[0][0]

-1.0404635120476469

I was wondering if the number had to be copied to the CPU first before being printed and tried timing it.

%timeit A[0][0]
%timeit A_gpu[0][0]

231 *ns* ± 5.85 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
166 *µs* ± 25.6 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Accessing an element in the GPU is a thousand time slower then in the CPU. However, we can also print the shape and that is a little faster in the GPU, so I doubt anything had to go through the CPU just to be printed.

%timeit A.shape
%timeit A_gpu.shape

78.1 ns ± 1.1 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
58.8 ns ± 19.3 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

Why would accessing an element in the GPU be slower and is it a problem if we are doing it inside a @cuda.jit or it is optimized? (If the GPU has 1000 cores, and the array has size 1000*1000, the 1000 cores would access 1000 elements all at once 1000 times, which would add a non-negligible 166 µs * 1000)

@cuda.jit
def add_one_gpu(A):
  x, y = cuda.grid(2)
  m, n = A.shape
  if x < m and y < n: 
    **A[x, y]** += 1

Solution

Why would accessing an element in the GPU be slower and is it a problem if we are doing it inside a @cuda.jit or it is optimized?

In my view, this capability:

A_gpu[0][0]

in host (numba) python code is provided mostly as a convenience feature in Numba CUDA. Such a thing is not possible using ordinary device memory in CUDA C++. What numba is doing is copying the data back for you, from device to host, and then printing it. The implicit copy operation, not unlike the implicit kernel launch in your previous question has significant overhead.

Of course A_gpu here is a complex pythonic object with multiple attributes, some of which (such as the shape) are directly accessible on host numba python code. They don't require a device->host copy operation or its overhead.

By convention, in numba CUDA, if you print just the name of this object:

print(A_gpu)

you get some summary information about the object. You don't get a dump of the data contents of the object, like you would if A_gpu were an ordinary numpy array.

You shouldn't draw conclusions about device code behavior (the stuff inside a decorated @cuda.jit function) from your host code experiments. It will not require 166us per access. There is also no point in trying to infer device code performance using this methodology. Write some actual device code and test or benchmark it.

Properly written device code that is well organized for bulk coalesced access should be able to access device memory at the rate of 100's of Gigabytes per second, when measured in aggregate, or higher. If we take the inverse of this, for an 8-byte quantity that would correspond to 80 picoseconds access time, or lower. (However, in my view, this is not the way to think about accesses or discuss performance.)

Since we can't call print inside @cuda.jit

AFAIK, some limited printing from device code is supported in numba cuda.