The current searchsorted
works perfect with 1-d arrays:
import cupy as cp
ref = cp.arange(5, dtype=cp.float32)
secs = cp.arange(10, dtype=cp.float32)
cp.searchsorted(ref,secs)
the result is:
array([0, 1, 2, 3, 4, 5, 5, 5, 5, 5])
I want a function multi_searchsorted
that search each row of ref
in parallel, for example:
For ref
:
array([[0., 1., 2., 3., 4.],
[5., 6., 7., 8., 9.]], dtype=float32)
and secs
:
array([[0., 1., 2., 3., 4., 5., 6., 7., 8., 9.],
[5., 6., 7., 8., 9., 0., 1., 2., 3., 4.]], dtype=float32)
It will produce
array([[0, 1, 2, 3, 4, 5, 5, 5, 5, 5],
[0, 1, 2, 3, 4, 0, 0, 0, 0, 0]])
Does anybody know if there is any open source code implement it? If I have to write the code, do I need to rewrite the kernel from beginning or if I can reuse the existing searchsorted
function in some way?
Any other suggestions are welcomed.
pytorch have a nD searchsorted
and there is a smart way to convert nD problem to 1D as @robert-crovella mentioned. I write two version of functions to test their speed:
import cupy as cp
import torch
from cupyx.profiler import benchmark
The first one is basically a simple wrapper around pytorch searchsorted
:
def torch_searchsorted(ref,sec)
_ref = torch.as_tensor(ref)
_sec = torch.as_tensor(sec)
indices = torch.searchsorted(_ref,_sec,side='right')
indices = cp.asarray(indices)
return indices
The second is the smart way:
def cupy_searchsorted(a,b):
m,n = a.shape
max_num = cp.maximum(a.max() - a.min(), b.max() - b.min()) + 1
r = max_num*cp.arange(a.shape[0])[:,None]
p = cp.searchsorted( (a+r).ravel(), (b+r).ravel(), side='right' ).reshape(m,-1)
return p - n*(cp.arange(m)[:,None])
First test the correctness:
ref = cp.arange(20, dtype=cp.float32).reshape(4,5)
sec = cp.arange(-1,19, dtype=cp.float32).reshape(4,5)
torch_out = torch_searchsorted(ref,sec)
cupy_out = cupy_searchsorted(ref,sec)
The results are same:
array([[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4],
[0, 1, 2, 3, 4]])
Then test the speed:
width = 100
nlines = 10000
ref = cp.arange(-1.1,-1.1+width*nlines, dtype=cp.float32).reshape(nlines,width)
sec = cp.arange(-1.5,-1.5+width*nlines, dtype=cp.float32).reshape(nlines,width)
print(benchmark(torch_searchsorted,(ref, sec), n_repeat=100))
print(benchmark(cupy_searchsorted,(ref, sec), n_repeat=100))
The result is much suprising:
torch_searchsorted : CPU: 8305.727 us +/-737.419 (min: 8109.499 / max:15409.231) us GPU-0: 8937.603 us +/-737.363 (min: 8741.632 / max:16042.624) us
cupy_searchsorted : CPU: 354.953 us +/- 9.935 (min: 346.325 / max: 429.802) us GPU-0: 378.429 us +/- 9.438 (min: 370.688 / max: 450.560) us
The cupy version is much faster. I wonder if the data conversion costs too much time, so I do further test:
_ref = torch.as_tensor(ref)
_sec = torch.as_tensor(sec)
print(benchmark(torch.searchsorted,(_ref, _sec), n_repeat=100))
I get
searchsorted : CPU: 5669.537 us +/-86.444 (min: 5646.031 / max: 6487.139) us GPU-0: 5677.324 us +/-86.665 (min: 5653.792 / max: 6496.576) us
It gets faster but still not as fast as the cupy one. The test is on V100. pytorch version is
pytorch 1.13.1 py3.9_cuda11.7_cudnn8.5.0_0 pytorch
cupy version is 11.2.
Conclusion:
searchsorted
support nD, the performance is not so good.