Search code examples
pythonperformancenumpypandasbinning

Binning and then combining bins with minimum number of observations?


Let's say I create some data and then create bins of different sizes:

from __future__ import division
x = np.random.rand(1,20)
new, = np.digitize(x,np.arange(1,x.shape[1]+1)/100)
new_series = pd.Series(new)
print(new_series.value_counts())

reveals:

20 17
16 1
4  1
2  1
dtype: int64

I basically want to transform the underlying data, if I set a minimum threshold of at least 2 per bin, so that new_series.value_counts() is this:

20 17
16 3
dtype: int64

Solution

  • EDITED:

    x = np.random.rand(1,100)
    bins = np.arange(1,x.shape[1]+1)/100
    
    new = np.digitize(x,bins)
    n = new.copy()[0] # this will hold the the result
    
    threshold = 2
    
    for i in np.unique(n):
        if sum(n == i) <= threshold:
            n[n == i] += 1
    
    n.clip(0, bins.size) # avoid adding beyond the last bin
    n = n.reshape(1,-1)
    

    This can move counts up multiple times, until a bin is filled sufficiently.

    Instead of using np.digitize, it might be simpler to use np.histogram instead, because it will directly give you the counts, so that we don't need to sum ourselves.