I need to use pd.Interval to select records across multiple bin ranges.
df = pd.DataFrame({'my_col': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]})
df['my_col_bin'] = pd.cut(x=df['my_col'], bins=[0, 3, 6, 9, 12], right=False, include_lowest=True)
my_col my_col_bin
0 1 [0, 3)
1 2 [0, 3)
2 3 [3, 6)
3 4 [3, 6)
4 5 [3, 6)
5 6 [6, 9)
6 7 [6, 9)
7 8 [6, 9)
8 9 [9, 12)
9 10 [9, 12)
10 11 [9, 12)
For example, I would like to select all records that fall into range [3, 12). I would like to get the following output using single pd.Interval range, without specifying condition for each interval individually:
2 3 [3, 6)
3 4 [3, 6)
4 5 [3, 6)
5 6 [6, 9)
6 7 [6, 9)
7 8 [6, 9)
8 9 [9, 12)
9 10 [9, 12)
10 11 [9, 12)
I tried the following, which didn't work.
df[df['my_col_bin'] == pd.Interval(3, 12, closed='left')]
The following selection works on a single interval, but it appears that pd.Interval doesn't support multiple ranges.
df[df['my_col_bin'] == pd.Interval(3, 6, closed='left')]
Is it possible to select range across the bins without explicitly specifying each interval condition? Is there a succinct way to perform the selection without filtering for each interval individually?
Interval.overlaps
It does require an apply
, but it turns out to be rather fast even for a large DataFrame. (Takes about 50 ms for 1 Million rows on my machine)
m = df['my_col_bin'].apply(lambda x: x.overlaps(pd.Interval(3, 12, closed='left')))
df[m]
my_col my_col_bin
2 3 [3, 6)
3 4 [3, 6)
4 5 [3, 6)
5 6 [6, 9)
6 7 [6, 9)
7 8 [6, 9)
8 9 [9, 12)
9 10 [9, 12)
10 11 [9, 12)