Search code examples
pythonnumpymask

fastest way to get max value of each masked np.array for many masks?


I have two numpy arrays of the same shape. One contains information that I am interested in, and the other contains a bunch of integers that can be used as mask values.

In essence, I want to loop through each unique integer to get each mask for the array, then filtered the main array using this mask and find the max value of the filtered array.

For simplicity, lets say the arrays are:

arr1 = np.random.rand(10000,10000)
arr2 = np.random.randint(low=0, high=1000, size=(10000,10000))

right now I'm doing this:

maxes = {}
ids = np.unique(arr2)
for id in ids:
    max_val = arr1[np.equal(arr2, id)].max()
    maxes[id] = max_val

My arrays are alot bigger and this is painfully slow, I am strugging to find a quicker way of doing this...maybe there's some kind of creative method I'm not aware of, would really appreciate any help.

EDIT

let's say the majority of arr2 is actually 0 and I dont care about the 0 id, is it possible to speed it up by dropping this entire chunk from the search??

i.e.

arr2[:, 0:4000] = 0

and just return the maxes for ids > 0 ??

much appreciated..


Solution

  • Generic bin-based reduction strategies

    Listed below are few approaches to tackle such scenarios where we need to perform bin-based reduction operations. So, essentially we are given two arrays and we are required to use one as the bins and the other one for values and reduce the second one.

    Approach #1 : One strategy would be to sort arr1 based on arr2. Once we have them both sorted in that same order, we find the group start and stop indices and then with appropriate ufunc.reduceat, we do our slice-based reduction operation. That's all there is!

    Here's the implementation -

    def binmax(bins, values, reduceat_func):
        ''' Get binned statistic from two 1D arrays '''
        sidx = bins.argsort()
        bins_sorted = bins[sidx]
        grpidx = np.flatnonzero(np.r_[True,bins_sorted[:-1]!=bins_sorted[1:]])
        max_per_group = reduceat_func(values[sidx],grpidx)
        out = dict(zip(bins_sorted[grpidx], max_per_group))
        return out
    
    out = binmax(arr2.ravel(), arr1.ravel(), reduceat_func=np.maximum.reduceat)
    

    It's applicable across ufuncs that have their corresponding ufunc.reduceat methods.

    Approach #2 : We can also leverage scipy.stats.binned_statistic that 's basically a generic utility to do some of the common reduction operations based on binned array values -

    from scipy.stats import binned_statistic
    
    def binmax_v2(bins, values, statistic):
        ''' Get binned statistic from two 1D arrays '''
        num_labels = bins.max()+1
        R = np.arange(num_labels+1)
        Mx = binned_statistic(bins, values, statistic=statistic, bins=R)[0]
        idx = np.flatnonzero(~np.isnan(Mx))
        out  = dict(zip(idx, Mx[idx].astype(int)))
        return out
    
    out = binmax_v2(arr2.ravel(), arr1.ravel(), statistic='max')