I have a set of bins which can be defined by:
A set of tuples of non-overlaping boundaries of each bin:
Intervals: [(0,1), (1,2), (3,4)]
A set of indicators identifying which boundary of each tuple is closed:
Closed_Boundaries: ['right','right','both']
A set of labels for each interval
Labels: ['first','second','third']
I am looking for an efficient, elegant and scalable way of applying this binning to a numeric series in a pandas dataframe so that the result includes the respective labels as identified by the binning logic:
Data_input: [3.5, 1, 0.5, 3]
Data_result: ['third', 'first', 'first', 'third']
I tried with pandas.IntervalIndex.from_tuples() followed by pandas.cut(). However, the labels argument of pandas.cut() is disabled when using IntervalIndex.from_tuples() while the name argument of the latter doesn't allow me to set labels to be used as replacement values.
PS: The pandas problem with labels not supported with IntervalIndex has been discussed here.
The simplest way is to use pd.cut
. However, there is an outstanding bug where it ignores labels
when bins
is IntervalIndex
.
def cut(array, bins, labels, closed='right'):
_bins = pd.IntervalIndex.from_tuples(bins, closed=closed)
x = pd.cut(array, _bins)
x.categories = labels # workaround for the bug
return x
array = [3.5, 1, 0.5, 3]
bins = [(0,1), (1,2), (3,4)]
labels = ['first', 'second', 'third']
df = pd.DataFrame({
'value': array,
'category': cut(array, bins, labels, closed='right')
})
Output:
value category
0 3.5 third
1 1.0 first
2 0.5 first
3 3.0 NaN
Things get a lot slower because the code is not vectorized but it's conceptually simple: for each item in the array, find the first bin that it falls into and add the label of that bin.
def cut(array, bins, labels):
intervals = [pd.Interval(*b) for b in bins]
categories = []
for value in array:
cat = None
for i, interval in enumerate(intervals):
if value in interval:
cat = labels[i]
break
categories.append(cat)
return categories
cut([3.5, 1, 0.5, 3], bins=[(0,1,'right'),(1,2,'right'),(3,4,'left')], labels=['first', 'second', 'third'])
I modified the bin tuples to include what side they are closed on. Options are left
, right
, both
and neither
.