>>> import numpy_indexed as npi
>>> import numpy as np
>>> a = np.array([[0,0,1,1,2,2], [4,4,8,8,10,10]]).T
>>> a
array([[ 0, 4],
[ 0, 4],
[ 1, 8],
[ 1, 8],
[ 2, 10],
[ 2, 10]])
>>> npi.group_by(a[:, 0]).sum(a[:,1])
(array([0, 1, 2]), array([ 8, 16, 20], dtype=int32))
I want to perform calculations on subsets of the second column clustered by the first column on large sets (~1m lines). Is there an efficient (and/or vectorised) way to use the output of group_by
by numpy_indexed
in order to add a new column with the output of these calculations? In the example of sum
as above I would like to produce the output below.
If there is an efficient way of doing this without using numpy_indexed
in the first place, that would also be very helpful.
array([[ 0, 4, 8],
[ 0, 4, 8],
[ 1, 8, 16],
[ 1, 8, 16],
[ 2, 10, 20],
[ 2, 10, 20]])
One approach with np.unique
to generate those unique tags and the interval shifting indices and then np.add.reduceat
for the intervaled-summing
-
_,idx,tags = np.unique(a[:,0], return_index=1, return_inverse=1)
out = np.c_[a, np.add.reduceat(a[:,1],idx)[tags]]
Another way that avoids the use of np.unique
and might be beneficial on performance would be like so -
idx = np.r_[0,np.flatnonzero(a[1:,0] > a[:-1,0])+1]
tag_arr = np.zeros(a.shape[0], dtype=int)
tag_arr[idx[1:]] = 1
tags = tag_arr.cumsum()
out = np.c_[a, np.add.reduceat(a[:,1],idx)[tags]]
For further performance boost, we should use np.bincount
. Thus, np.add.reduceat(a[:,1],idx)
could be replaced by np.bincount(tags, a[:,1])
.
Sample run -
In [271]: a # Using a more generic sample
Out[271]:
array([[11, 4],
[11, 4],
[14, 8],
[14, 8],
[16, 10],
[16, 10]])
In [272]: _,idx,tags = np.unique(a[:,0], return_index=1, return_inverse=1)
In [273]: np.c_[a, np.add.reduceat(a[:,1],idx)[tags]]
Out[273]:
array([[11, 4, 8],
[11, 4, 8],
[14, 8, 16],
[14, 8, 16],
[16, 10, 20],
[16, 10, 20]])]
Now, the listed approaches assume that the first column is already sorted. If that's not the case, we need to sort the array by the first column argsort
and then use the proposed method. Thus, for the not sorted case, we need the following as pre-processing -
a = a[a[:,0].argsort()]
Battle against np.unique
Let's time the custom flatnonzero
+ cumsum
based method against the built-in np.unique
to create the shifting indices : idx
and the uniqueness based IDs/tags : tags
. For a case like this one, where we know beforehand that the labels column is already sorted, we are avoiding any sorting, as done with np.unique
. This gives us an advantage on performance. So, let's verify it.
Approaches -
def nonzero_cumsum_based(A):
idx = np.concatenate(( [0] ,np.flatnonzero(A[1:] > A[:-1])+1 ))
tags = np.zeros(len(A), dtype=int)
tags[idx[1:]] = 1
np.cumsum(tags, out = tags)
return idx, tags
def unique_based(A):
_,idx,tags = np.unique(A, return_index=1, return_inverse=1)
return idx, tags
Sample run with the custom func -
In [438]: a
Out[438]:
array([[11, 4],
[11, 4],
[14, 8],
[14, 8],
[16, 10],
[16, 10]])
In [439]: idx, tags = nonzero_cumsum_based(a[:,0])
In [440]: idx
Out[440]: array([0, 2, 4])
In [441]: tags
Out[441]: array([0, 0, 1, 1, 2, 2])
Timings -
In [444]: a = np.c_[np.sort(randi(10,10000,(100000))), randi(0,10000,(100000))]
In [445]: %timeit unique_based(a[:,0])
100 loops, best of 3: 4.3 ms per loop
In [446]: %timeit nonzero_cumsum_based(a[:,0])
1000 loops, best of 3: 486 µs per loop
In [447]: a = np.c_[np.sort(randi(10,10000,(1000000))), randi(0,10000,(1000000))]
In [448]: %timeit unique_based(a[:,0])
10 loops, best of 3: 50.2 ms per loop
In [449]: %timeit nonzero_cumsum_based(a[:,0])
100 loops, best of 3: 3.98 ms per loop