Search code examples
pythonpandasseriesbinning

How can I bin a Pandas Series setting the bin size to a preset value of max/min for each bin


I have a pd.Series of floats and I would like to bin it into n bins where the bin size for each bin is set so that max/min is a preset value (e.g. 1.20)?

The requirement means that the size of the bins is not constant. For example:

data = pd.Series(np.arange(1, 11.0))
print(data)

0     1.0
1     2.0
2     3.0
3     4.0
4     5.0
5     6.0
6     7.0
7     8.0
8     9.0
9    10.0
dtype: float64

I would like the bin sizes to be:

1.00 <= bin 1 < 1.20
1.20 <= bin 2 < 1.20 x 1.20 = 1.44
1.44 <= bin 3 < 1.44 x 1.20 = 1.73
...

etc

Thanks


Solution

  • Thanks everyone for all the suggestions. None does quite what I was after (probably because my original question wasn't clear enough) but they really helped me figure out what to do so I have decided to post my own answer (I hope this is what I am supposed to do as I am relatively new at being an active member of stackoverflow...)

    I liked @yatu's vectorised suggestion best because it will scale better with large data sets but I am after the means to not only automatically calculate the bins but also figure out the minimum number of bins needed to cover the data set.

    This is my proposed algorithm:

    1. The bin size is defined so that bin_max_i/bin_min_i is constant:
    bin_max_i / bin_min_i = bin_ratio
    
    1. Figure out the number of bins for the required bin size (bin_ratio):
    data_ratio = data_max / data_min
    n_bins = math.ceil( math.log(data_ratio) / math.log(bin_ratio) )
    
    1. Set the lower boundary for the smallest bin so that the smallest data point fits in it:
    bin_min_0 = data_min
    
    1. Create n non-overlapping bins meeting the conditions:
    bin_min_i+1 = bin_max_i
    bin_max_i+1 = bin_min_i+1 * bin_ratio
    
    1. Stop creating further bins once all dataset can be split between the bins already created. In other words, stop once:
    bin_max_last > data_max
    

    Here is a code snippet:

    import math
    import pandas as pd
    
    bin_ratio = 1.20
    
    data = pd.Series(np.arange(2,12))
    data_ratio = max(data) / min(data)
    
    n_bins = math.ceil( math.log(data_ratio) / math.log(bin_ratio) )
    n_bins = n_bins + 1               # bin ranges are defined as [min, max)
    
    bins = np.full(n_bins, bin_ratio) # initialise the ratios for the bins limits
    bins[0] = bin_min_0               # initialise the lower limit for the 1st bin
    bins = np.cumprod(bins)           # generate bins
    
    print(bins)
    [ 2.          2.4         2.88        3.456       4.1472      4.97664
      5.971968    7.1663616   8.59963392 10.3195607  12.38347284]
    

    I am now set to build a histogram of the data:

    data.hist(bins=bins)