Search code examples
arraysnumpyscipyvectorizationbinning

Split an array into bins of equal numbers


I have an array (not sorted) of N elements. I'd like to keep the original order of N, but instead of the actual elements, I'd like them to have their bin numbers, where N is split into m bins of equal (if N is divisible by m) or nearly equal (N not divisible by m) values. I need a vectorized solution (since N is fairly large, so standard python methods won't be efficient). Is there anything in scipy or numpy that can do this?

e.g.
N = [0.2, 1.5, 0.3, 1.7, 0.5]
m = 2
Desired output: [0, 1, 0, 1, 0]

I've looked at numpy.histogram, but it doesn't give me unequally spaced bins.


Solution

  • Listed in this post is a NumPy based vectorized approach with the idea of creating equally spaced indices for the length of the input array using np.searchsorted - Here's the implementation -

    def equal_bin(N, m):
        sep = (N.size/float(m))*np.arange(1,m+1)
        idx = sep.searchsorted(np.arange(N.size))
        return idx[N.argsort().argsort()]
    

    Sample runs with bin-counting for each bin to verify results -

    In [442]: N = np.arange(1,94)
    
    In [443]: np.bincount(equal_bin(N, 4))
    Out[443]: array([24, 23, 23, 23])
    
    In [444]: np.bincount(equal_bin(N, 5))
    Out[444]: array([19, 19, 18, 19, 18])
    
    In [445]: np.bincount(equal_bin(N, 10))
    Out[445]: array([10,  9,  9, 10,  9,  9, 10,  9,  9,  9])
    

    Here's another approach using linspace to create those equally spaced numbers that could be used as indices, like so -

    def equal_bin_v2(N, m):
        idx = np.linspace(0,m,N.size+0.5, endpoint=0).astype(int)
        return idx[N.argsort().argsort()]  
    

    Sample run -

    In [689]: N
    Out[689]: array([ 0.2,  1.5,  0.3,  1.7,  0.5])
    
    In [690]: equal_bin_v2(N,2)
    Out[690]: array([0, 1, 0, 1, 0])
    
    In [691]: equal_bin_v2(N,3)
    Out[691]: array([0, 1, 0, 2, 1])
    
    In [692]: equal_bin_v2(N,4)
    Out[692]: array([0, 2, 0, 3, 1])
    
    In [693]: equal_bin_v2(N,5)
    Out[693]: array([0, 3, 1, 4, 2])