I am hoping to move my custom camera video pipeline to use video memory with a combination of numba and cupy and avoid passing data back to the host memory if at all possible. As part of doing this I need to port my sharpness detection routine to use cuda. The easiest way to do this seemed to be to use cupy as essential all I do is compute the variance of a laplace transform of each image. The trouble I am hitting is the cupy variance computation appears to be ~ 8x slower than numpy. This includes the time it takes to copy the device ndarray to the host and perform the variance computation on the cpu using numpy. I am hoping to gain a better understanding of why the variance computation ReductionKernel employed by cupy on the GPU is so much slower. I'll start by including the test I ran below.
import cupy as cp
import numpy as np
from numba import cuda
import cv2
from timeit import default_timer as timer
n_runs = 10
n_warmup = 2
n_tot = n_runs + n_warmup
sizes = (100, 500, 1000, 2000, 5000, 10000)
# NumPy
for s in sizes:
t_cp = np.zeros(n_tot)
for n in range(n_tot):
np.random.seed(0)
x = np.random.randn(*(s,s)).astype(np.uint16)
t_start = timer()
_ = np.var(x)
t_cp[n] = timer() - t_start
t_mean, t_std = t_cp[n_warmup:].mean(), t_cp[n_warmup:].std()
print(f'NumPy: {s}x{s} {t_mean:.5f} +- {t_std:.5f}')
# CuPy
for s in sizes:
t_cp = np.zeros(n_tot)
for n in range(n_tot):
np.random.seed(0)
x = np.random.randn(*(s,s)).astype(np.uint16)
x_nb = cuda.to_device(x)
cp.cuda.Stream.null.synchronize()
t_start = timer()
x_cp = cp.asarray(x_nb)
_ = cp.var(x_cp)
cp.cuda.Stream.null.synchronize()
t_cp[n] = timer() - t_start
t_mean, t_std = t_cp[n_warmup:].mean(), t_cp[n_warmup:].std()
print(f'CuPy: {s}x{s} {t_mean:.5f} +- {t_std:.5f}')
The output of this script is as follows
NumPy: 100x100 0.00006 +- 0.00000
NumPy: 500x500 0.00053 +- 0.00008
NumPy: 1000x1000 0.00243 +- 0.00029
NumPy: 2000x2000 0.01073 +- 0.00136
NumPy: 5000x5000 0.07470 +- 0.00264
NumPy: 10000x10000 0.28578 +- 0.00313
...
CuPy: 100x100 0.00026 +- 0.00002
CuPy: 500x500 0.00463 +- 0.00068
CuPy: 1000x1000 0.02308 +- 0.00060
CuPy: 2000x2000 0.07876 +- 0.00172
CuPy: 5000x5000 0.50830 +- 0.02237
CuPy: 10000x10000 1.98131 +- 0.03540
I also tried using float16 and float32 instead of uint16 (as it looks like the reduction kernels used by cupy work with floats but it didn't change the disparity in any meaningful way for me).
https://github.com/cupy/cupy/blob/04bf15706474a5e79ba196e70784a147cad6e26e/cupy/_core/_routines_statistics.pyx#L542
Here are some versions associate with my working environment.
>>> numpy.__version__
'1.18.5'
>>> cupy.__version__
'9.6.0'
>>> numba.__version__
'0.53.1'
Python 3.6.9
Driver Version: 470.82.01
CUDA Version: 11.4
Any tips about what might be causing cupy to perform so poorly would be appreciated. Also if there are any things that can be done to improve the performance I'de love to know. I tried reading about ReductionKernels to understand how to optimize them but it's over my head. https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf
updated results based on feedback from @myrtlecat
with dtype np.float32
CuPy (1 axis at a time): 100x100 0.00017 +- 0.00002
CuPy (1 axis at a time): 500x500 0.00041 +- 0.00001
CuPy (1 axis at a time): 1000x1000 0.00097 +- 0.00003
CuPy (1 axis at a time): 2000x2000 0.00278 +- 0.00016
CuPy (1 axis at a time): 5000x5000 0.01381 +- 0.00041
CuPy (1 axis at a time): 10000x10000 0.04313 +- 0.00355
CuPy: 100x100 0.00013 +- 0.00000
CuPy: 500x500 0.00432 +- 0.00013
CuPy: 1000x1000 0.01713 +- 0.00070
CuPy: 2000x2000 0.06713 +- 0.00079
CuPy: 5000x5000 0.41975 +- 0.00259
CuPy: 10000x10000 1.69374 +- 0.01938
NumPy: 100x100 0.00004 +- 0.00000
NumPy: 500x500 0.00022 +- 0.00001
NumPy: 1000x1000 0.00121 +- 0.00018
NumPy: 2000x2000 0.00530 +- 0.00047
NumPy: 5000x5000 0.03432 +- 0.00179
NumPy: 10000x10000 0.14227 +- 0.00503
with dtype np.uint16
CuPy (1 axis at a time): 100x100 0.00018 +- 0.00000
CuPy (1 axis at a time): 500x500 0.00124 +- 0.00003
CuPy (1 axis at a time): 1000x1000 0.00430 +- 0.00020
CuPy (1 axis at a time): 2000x2000 0.01537 +- 0.00008
CuPy (1 axis at a time): 5000x5000 0.07413 +- 0.00330
CuPy (1 axis at a time): 10000x10000 0.29842 +- 0.01291
CuPy: 100x100 0.00016 +- 0.00000
CuPy: 500x500 0.00359 +- 0.00053
CuPy: 1000x1000 0.01952 +- 0.00058
CuPy: 2000x2000 0.07719 +- 0.00076
CuPy: 5000x5000 0.48284 +- 0.00169
CuPy: 10000x10000 1.96746 +- 0.04353
NumPy: 100x100 0.00006 +- 0.00002
NumPy: 500x500 0.00053 +- 0.00010
NumPy: 1000x1000 0.00224 +- 0.00016
NumPy: 2000x2000 0.00956 +- 0.00034
NumPy: 5000x5000 0.06818 +- 0.00210
NumPy: 10000x10000 0.27071 +- 0.00747
updated script based on feedback from @myrtlecat
import cupy as cp
import numpy as np
from timeit import default_timer as timer
n_runs = 10
n_warmup = 2
n_tot = n_runs + n_warmup
sizes = (100, 500, 1000, 2000, 5000, 10000)
dtype = np.uint16
# dtype = np.float32
def mean(x):
while x.size > 1:
x = x.mean(-1)
return x
def var(x):
return mean((x - mean(x)) ** 2)
# CuPy (1 axis at a time)
for s in sizes:
t_cp = np.zeros(n_tot)
for n in range(n_tot):
# np.random.seed(0)
x = np.random.randn(*(s, s)).astype(dtype)
x_cp = cp.asarray(x)
cp.cuda.Stream.null.synchronize()
t_start = timer()
_ = var(x_cp)
cp.cuda.Stream.null.synchronize()
t_cp[n] = timer() - t_start
t_mean, t_std = t_cp[n_warmup:].mean(), t_cp[n_warmup:].std()
print(f"CuPy (1 axis at a time): {s}x{s} {t_mean:.5f} +- {t_std:.5f}")
# CuPy
for s in sizes:
t_cp = np.zeros(n_tot)
for n in range(n_tot):
# np.random.seed(0)
x = np.random.randn(*(s, s)).astype(dtype)
x_cp = cp.asarray(x)
cp.cuda.Stream.null.synchronize()
t_start = timer()
_ = cp.var(x_cp)
cp.cuda.Stream.null.synchronize()
t_cp[n] = timer() - t_start
t_mean, t_std = t_cp[n_warmup:].mean(), t_cp[n_warmup:].std()
print(f"CuPy: {s}x{s} {t_mean:.5f} +- {t_std:.5f}")
# NumPy
for s in sizes:
t_cp = np.zeros(n_tot)
for n in range(n_tot):
# np.random.seed(0)
x = np.random.randn(*(s, s)).astype(dtype)
t_start = timer()
_ = np.var(x)
t_cp[n] = timer() - t_start
t_mean, t_std = t_cp[n_warmup:].mean(), t_cp[n_warmup:].std()
print(f"NumPy: {s}x{s} {t_mean:.5f} +- {t_std:.5f}")
last updated script and results based on feedback from @myrtlecat where he said that it may be an issue with 2d arrays so I tried using reshape to flatten the arrays before calling "var"
# CuPy (flattened)
for s in sizes:
t_cp = np.zeros(n_tot)
for n in range(n_tot):
# np.random.seed(0)
x = np.random.randn(*(s, s)).astype(dtype)
x_cp = cp.asarray(x)
cp.cuda.Stream.null.synchronize()
t_start = timer()
_ = var(x_cp.reshape((s * s,)))
cp.cuda.Stream.null.synchronize()
t_cp[n] = timer() - t_start
t_mean, t_std = t_cp[n_warmup:].mean(), t_cp[n_warmup:].std()
print(f"CuPy (flattened): {s}x{s} {t_mean:.5f} +- {t_std:.5f}")
the results
for uint16
CuPy (flattened): 100x100 0.00018 +- 0.00006
CuPy (flattened): 500x500 0.00107 +- 0.00002
CuPy (flattened): 1000x1000 0.00414 +- 0.00020
CuPy (flattened): 2000x2000 0.01550 +- 0.00036
CuPy (flattened): 5000x5000 0.10017 +- 0.00525
CuPy (flattened): 10000x10000 0.39470 +- 0.01606
CuPy (1 axis at a time): 100x100 0.00026 +- 0.00008
CuPy (1 axis at a time): 500x500 0.00104 +- 0.00005
CuPy (1 axis at a time): 1000x1000 0.00368 +- 0.00028
CuPy (1 axis at a time): 2000x2000 0.01364 +- 0.00055
CuPy (1 axis at a time): 5000x5000 0.07639 +- 0.00311
CuPy (1 axis at a time): 10000x10000 0.29405 +- 0.00419
for float32
CuPy (flattened): 100x100 0.00015 +- 0.00007
CuPy (flattened): 500x500 0.00043 +- 0.00001
CuPy (flattened): 1000x1000 0.00159 +- 0.00003
CuPy (flattened): 2000x2000 0.00631 +- 0.00030
CuPy (flattened): 5000x5000 0.03827 +- 0.00135
CuPy (flattened): 10000x10000 0.13173 +- 0.00579
CuPy (1 axis at a time): 100x100 0.00020 +- 0.00005
CuPy (1 axis at a time): 500x500 0.00030 +- 0.00004
CuPy (1 axis at a time): 1000x1000 0.00077 +- 0.00002
CuPy (1 axis at a time): 2000x2000 0.00215 +- 0.00002
CuPy (1 axis at a time): 5000x5000 0.01387 +- 0.00049
CuPy (1 axis at a time): 10000x10000 0.04099 +- 0.00142
So it seems like for both uint16 and float32 the results are faster using the 1 axis at a time technique suggested, rather than flattening the 2d array with reshape. Although the reshape is significantly faster than passing the 2d array.
I have a partial hypothesis about the problem (not a full explanation) and a work-around. Perhaps someone can fill in the gaps. I've used a quicker-and-dirtier benchmark, for brevity's sake.
Cupy is much faster when reduction is performed on one axis at a time. In stead of:
x.sum()
prefer this:
x.sum(-1).sum(-1).sum(-1)...
Note that the results of these computations may differ due to rounding error.
Here are faster mean
and var
functions:
def mean(x):
while x.size > 1:
x = x.mean(-1)
return x
def var(x):
return mean((x - mean(x)) ** 2)
import numpy as np
import cupy as cp
from timeit import timeit
def mean(x):
while x.size > 1:
x = x.mean(-1)
return x
def var(x):
return mean((x - mean(x)) ** 2)
def benchmark(label, f):
t = timeit(f, number=10) / 10
print(f"{label.ljust(25)}{t:0.4f}s")
x = np.random.rand(5000, 5000).astype('f4')
x_cp = cp.array(x)
benchmark("Numpy", lambda: x.var())
benchmark("Cupy", lambda: float(x_cp.var()))
benchmark("Cupy (1 axis at a time)", lambda: float(var(x_cp)))
Yielding more than 100x speed-up:
Numpy 0.0469s
Cupy 0.3805s
Cupy (1 axis at a time) 0.0013s
The answers are close but not identical due to rounding:
> x.var(), float(x_cp.var()), float(var(x_cp))
(0.08334004, 0.08333975821733475, 0.08334004133939743)
CUDA-capable GPUs are generally slower than modern CPUs when operating on a single thread, but they achieve high throughput by performing many identical operations in parallel.
Reduction like sum
or var
can be parallelised by:
For a sufficiently large input array, if the chunk size is chosen well, this will be close to optimal performance. (Note: for small arrays, or squeeze every last drop of performance, requires more advanced techniques like the slides OP linked to).
I think that cupy should be applying this technique to all reductions (my reading of the cupy code is that it does do this). However, for whatever reason, it does a good job of parallelising reductions over a single axis, but not over an entire array. I doubt that this is intended behaviour, but I am not a dev on cupy and so perhaps it is. The generated CUDA code and parameters used to invoke the reduction kernels are quite buried, so I'm not sure exactly what is happening in each case.
Result of running the updated benchmark script for uint16
:
CuPy (1 axis at a time): 100x100 0.00016 +- 0.00001
CuPy (1 axis at a time): 500x500 0.00029 +- 0.00000
CuPy (1 axis at a time): 1000x1000 0.00070 +- 0.00000
CuPy (1 axis at a time): 2000x2000 0.00242 +- 0.00001
CuPy (1 axis at a time): 5000x5000 0.01410 +- 0.00001
CuPy (1 axis at a time): 10000x10000 0.05145 +- 0.00149
CuPy: 100x100 0.00016 +- 0.00000
CuPy: 500x500 0.00316 +- 0.00000
CuPy: 1000x1000 0.01250 +- 0.00001
CuPy: 2000x2000 0.07283 +- 0.00290
CuPy: 5000x5000 0.44025 +- 0.00012
CuPy: 10000x10000 1.76455 +- 0.00190
NumPy: 100x100 0.00004 +- 0.00000
NumPy: 500x500 0.00056 +- 0.00001
NumPy: 1000x1000 0.00201 +- 0.00001
NumPy: 2000x2000 0.01066 +- 0.00005
NumPy: 5000x5000 0.08828 +- 0.00007
NumPy: 10000x10000 0.35403 +- 0.00064
so about 7x speed-up over numpy, and and for float32
:
CuPy (1 axis at a time): 100x100 0.00016 +- 0.00001
CuPy (1 axis at a time): 500x500 0.00018 +- 0.00001
CuPy (1 axis at a time): 1000x1000 0.00021 +- 0.00000
CuPy (1 axis at a time): 2000x2000 0.00052 +- 0.00000
CuPy (1 axis at a time): 5000x5000 0.00232 +- 0.00001
CuPy (1 axis at a time): 10000x10000 0.00837 +- 0.00002
CuPy: 100x100 0.00015 +- 0.00000
CuPy: 500x500 0.00281 +- 0.00001
CuPy: 1000x1000 0.01657 +- 0.00026
CuPy: 2000x2000 0.06557 +- 0.00003
CuPy: 5000x5000 0.37905 +- 0.00007
CuPy: 10000x10000 1.51899 +- 0.01084
NumPy: 100x100 0.00003 +- 0.00000
NumPy: 500x500 0.00032 +- 0.00000
NumPy: 1000x1000 0.00115 +- 0.00001
NumPy: 2000x2000 0.00581 +- 0.00010
NumPy: 5000x5000 0.04730 +- 0.00009
NumPy: 10000x10000 0.19188 +- 0.00024
about 40x speed-up over numpy.
Results will be platform-dependent: for comparison, my CPU is Intel(R) Core(TM) i9-7940X CPU @ 3.10GHz, and my GPU is NVIDIA GeForce GTX 1080 Ti.