Search code examples
pythonpandasdataframecategorical-databinning

Binning by value, except last bin


I am trying to bin data as follows:

pd.cut(df['col'], np.arange(0,1.2, 0.2),include_lowest=True))

But I would like to ensure that any data greater than 1 is also included in that last bin. I can do this in a couple lines, but wondering if anyone knows a one-liner/more pythonic way of doing this?

PS - I am not looking to do a qcut-- I need the bins to be separated by their values, and not the count of records.


Solution

  • Solution 1: prepare labels (using first 5 rows of the DF) and replace 1 with np.inf in the bins parameter:

    In [67]: df
    Out[67]:
              a         b         c
    0  1.698479  0.337989  0.002482
    1  0.903344  1.830499  0.095253
    2  0.152001  0.439870  0.270818
    3  0.621822  0.124322  0.471747
    4  0.534484  0.051634  0.854997
    5  0.980915  1.065050  0.211227
    6  0.809973  0.894893  0.093497
    7  0.677761  0.333985  0.349353
    8  1.491537  0.622429  1.456846
    9  0.294025  1.286364  0.384152
    
    In [68]: labels = pd.cut(df.a.head(), np.arange(0,1.2, 0.2), include_lowest=True).cat.categories
    
    In [69]: pd.cut(df.a, np.append(np.arange(0, 1, 0.2), np.inf), labels=labels, include_lowest=True)
    Out[69]:
    0      (0.8, 1]
    1      (0.8, 1]
    2      [0, 0.2]
    3    (0.6, 0.8]
    4    (0.4, 0.6]
    5      (0.8, 1]
    6      (0.8, 1]
    7    (0.6, 0.8]
    8      (0.8, 1]
    9    (0.2, 0.4]
    Name: a, dtype: category
    Categories (5, object): [[0, 0.2] < (0.2, 0.4] < (0.4, 0.6] < (0.6, 0.8] < (0.8, 1]]
    

    Explanation:

    In [72]: np.append(np.arange(0, 1, 0.2), np.inf)
    Out[72]: array([ 0. ,  0.2,  0.4,  0.6,  0.8,  inf])
    
    In [73]: labels
    Out[73]: Index(['[0, 0.2]', '(0.2, 0.4]', '(0.4, 0.6]', '(0.6, 0.8]', '(0.8, 1]'], dtype='object')
    

    Solution 2: clip all values greater than 1

    In [70]: pd.cut(df.a.clip(upper=1), np.arange(0,1.2, 0.2),include_lowest=True)
    Out[70]:
    0      (0.8, 1]
    1      (0.8, 1]
    2      [0, 0.2]
    3    (0.6, 0.8]
    4    (0.4, 0.6]
    5      (0.8, 1]
    6      (0.8, 1]
    7    (0.6, 0.8]
    8      (0.8, 1]
    9    (0.2, 0.4]
    Name: a, dtype: category
    Categories (5, object): [[0, 0.2] < (0.2, 0.4] < (0.4, 0.6] < (0.6, 0.8] < (0.8, 1]]
    

    Explanation:

    In [75]: df.a
    Out[75]:
    0    1.698479
    1    0.903344
    2    0.152001
    3    0.621822
    4    0.534484
    5    0.980915
    6    0.809973
    7    0.677761
    8    1.491537
    9    0.294025
    Name: a, dtype: float64
    
    In [76]: df.a.clip(upper=1)
    Out[76]:
    0    1.000000
    1    0.903344
    2    0.152001
    3    0.621822
    4    0.534484
    5    0.980915
    6    0.809973
    7    0.677761
    8    1.000000
    9    0.294025
    Name: a, dtype: float64