Slicing a 300MB CuPy array is ~5x slower than NumPy

My code involves slicing into 432x432x400 arrays a total of ~10 million times to generate batches of data for neural network training. As these are fairly large arrays (92 million data points/300MB), I was hoping to speed this up using CuPy (and maybe even speed training up by generating data on the same GPU as training), but found it actually made the code about 5x slower.

Is this expected behaviour due to CuPy overheads or am I missing something?

Code to reproduce:

import cupy as cp
import numpy as np
import timeit
cp_arr = cp.zeros((432, 432, 400), dtype=cp.float32)
np_arr = np.zeros((432, 432, 400), dtype=np.float32)

# numbers below are representative of my code
cp_code = 'arr2 = cp_arr[100:120, 100:120, 100:120]'
np_code = 'arr2 = np_arr[100:120, 100:120, 100:120]'

timeit.timeit(cp_code, number=8192*4, globals=globals())  # prints 0.122
timeit.timeit(np_code, number=8192*4, globals=globals())  # prints 0.027

Setup:

GPU: NVIDIA Quadro P4000
CuPy Version: 7.3.0
OS: CentOS Linux 7
CUDA Version: 10.1
cuDNN Version: 7.6.5

Solution

Slicing in NumPy and CuPy is not actually copying the data anywhere, but simply returning a new array where the data is the same but with the its pointer being offset to the first element of the new slice and an adjusted shape. Note below how both the original array and the slice have the same strides:

In [1]: import cupy as cp

In [2]: a = cp.zeros((432, 432, 400), dtype=cp.float32)

In [3]: b = a[100:120, 100:120, 100:120]

In [4]: a.strides
Out[4]: (691200, 1600, 4)

In [5]: b.strides
Out[5]: (691200, 1600, 4)

The same above could be verified by replacing CuPy with NumPy.

If you want to time the actual slicing operation, the most reliable way of doing this would be to add a .copy() to each operation, thus enforcing the memory accessing/copying:

cp_code = 'arr2 = cp_arr[100:120, 100:120, 100:120].copy()'  # 0.771 seconds
np_code = 'arr2 = np_arr[100:120, 100:120, 100:120].copy()'  # 0.154 seconds

Unfortunately, for the case above the memory pattern is bad for GPUs as the small chunks won't be able to saturate memory channels, thus it's still slower than NumPy. However, CuPy can be much faster if the chunks are able to get close to memory channel saturation, for example:

cp_code = 'arr2 = cp_arr[:, 100:120, 100:120].copy()'  # 0.786 seconds
np_code = 'arr2 = np_arr[:, 100:120, 100:120].copy()'  # 2.911 seconds