I am trying to bin data as follows:
pd.cut(df['col'], np.arange(0,1.2, 0.2),include_lowest=True))
But I would like to ensure that any data greater than 1 is also included in that last bin. I can do this in a couple lines, but wondering if anyone knows a one-liner/more pythonic way of doing this?
PS - I am not looking to do a qcut-- I need the bins to be separated by their values, and not the count of records.
Solution 1: prepare labels
(using first 5 rows of the DF) and replace 1
with np.inf
in the bins
parameter:
In [67]: df
Out[67]:
a b c
0 1.698479 0.337989 0.002482
1 0.903344 1.830499 0.095253
2 0.152001 0.439870 0.270818
3 0.621822 0.124322 0.471747
4 0.534484 0.051634 0.854997
5 0.980915 1.065050 0.211227
6 0.809973 0.894893 0.093497
7 0.677761 0.333985 0.349353
8 1.491537 0.622429 1.456846
9 0.294025 1.286364 0.384152
In [68]: labels = pd.cut(df.a.head(), np.arange(0,1.2, 0.2), include_lowest=True).cat.categories
In [69]: pd.cut(df.a, np.append(np.arange(0, 1, 0.2), np.inf), labels=labels, include_lowest=True)
Out[69]:
0 (0.8, 1]
1 (0.8, 1]
2 [0, 0.2]
3 (0.6, 0.8]
4 (0.4, 0.6]
5 (0.8, 1]
6 (0.8, 1]
7 (0.6, 0.8]
8 (0.8, 1]
9 (0.2, 0.4]
Name: a, dtype: category
Categories (5, object): [[0, 0.2] < (0.2, 0.4] < (0.4, 0.6] < (0.6, 0.8] < (0.8, 1]]
Explanation:
In [72]: np.append(np.arange(0, 1, 0.2), np.inf)
Out[72]: array([ 0. , 0.2, 0.4, 0.6, 0.8, inf])
In [73]: labels
Out[73]: Index(['[0, 0.2]', '(0.2, 0.4]', '(0.4, 0.6]', '(0.6, 0.8]', '(0.8, 1]'], dtype='object')
Solution 2: clip all values greater than 1
In [70]: pd.cut(df.a.clip(upper=1), np.arange(0,1.2, 0.2),include_lowest=True)
Out[70]:
0 (0.8, 1]
1 (0.8, 1]
2 [0, 0.2]
3 (0.6, 0.8]
4 (0.4, 0.6]
5 (0.8, 1]
6 (0.8, 1]
7 (0.6, 0.8]
8 (0.8, 1]
9 (0.2, 0.4]
Name: a, dtype: category
Categories (5, object): [[0, 0.2] < (0.2, 0.4] < (0.4, 0.6] < (0.6, 0.8] < (0.8, 1]]
Explanation:
In [75]: df.a
Out[75]:
0 1.698479
1 0.903344
2 0.152001
3 0.621822
4 0.534484
5 0.980915
6 0.809973
7 0.677761
8 1.491537
9 0.294025
Name: a, dtype: float64
In [76]: df.a.clip(upper=1)
Out[76]:
0 1.000000
1 0.903344
2 0.152001
3 0.621822
4 0.534484
5 0.980915
6 0.809973
7 0.677761
8 1.000000
9 0.294025
Name: a, dtype: float64