How to slice rows in numba CUDA?

I am a beginner in Numba. I have difficulty in re-arranging the rows of an array in GPU.

In Numba CPU, for example, this can be done by

from numba import njit
import numpy as np

@njit
def numba_cpu(A, B, ind):
    for i, t in enumerate(ind):
        B[i, :] = A[t, :]

ind = np.array([3, 2, 0, 1, 4])
A = np.random.rand(5, 3)
B = np.zeros((5, 3))
numba_cpu(A, B, ind)

But it does not work with cuda.jit

from numba import cuda
import numpy as np

@cuda.jit
def numba_gpu(A, B, ind):
    for i, t in enumerate(ind):
        B[i, :] = A[t, :]

d_ind = cuda.to_device(np.array([3, 2, 0, 1, 4]))
d_A = cuda.to_device(np.random.rand((5, 3)))
d_B = cuda.to_device(np.zeros((5, 3)))
numba_gpu[16,16](d_A, d_B, d_ind)

The program fails with a lot of exceptions, and it says "NRT required but not enabled".

Of course I can use a nested loop to copy entry by entry, but it looks bad because I know the a row is in consecutive memory. Even a C-language-like memcpy would be better. But it seems Numba does not support memcpy.

Solution

I think I have found a solution myself. What I need is to manipulate Numpy arrays in CUDA device. For this purpose, CuPy is much better than Numba. CuPy supports many Numpy-like operations (including the one in my question) in an efficient and convenient way.