Search code examples
pythonpandasetl

Selecting rows in Pandas using multiple intervals (pd.Interval range objects)


I need to use pd.Interval to select records across multiple bin ranges.

df = pd.DataFrame({'my_col': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]})
df['my_col_bin'] = pd.cut(x=df['my_col'], bins=[0, 3, 6, 9, 12], right=False, include_lowest=True)
    my_col  my_col_bin
0   1   [0, 3)
1   2   [0, 3)
2   3   [3, 6)
3   4   [3, 6)
4   5   [3, 6)
5   6   [6, 9)
6   7   [6, 9)
7   8   [6, 9)
8   9   [9, 12)
9   10  [9, 12)
10  11  [9, 12)

For example, I would like to select all records that fall into range [3, 12). I would like to get the following output using single pd.Interval range, without specifying condition for each interval individually:

2   3   [3, 6)
3   4   [3, 6)
4   5   [3, 6)
5   6   [6, 9)
6   7   [6, 9)
7   8   [6, 9)
8   9   [9, 12)
9   10  [9, 12)
10  11  [9, 12)

I tried the following, which didn't work.

df[df['my_col_bin'] == pd.Interval(3, 12, closed='left')]

The following selection works on a single interval, but it appears that pd.Interval doesn't support multiple ranges.

df[df['my_col_bin'] == pd.Interval(3, 6, closed='left')]

Is it possible to select range across the bins without explicitly specifying each interval condition? Is there a succinct way to perform the selection without filtering for each interval individually?


Solution

  • Interval.overlaps

    It does require an apply, but it turns out to be rather fast even for a large DataFrame. (Takes about 50 ms for 1 Million rows on my machine)

    m = df['my_col_bin'].apply(lambda x: x.overlaps(pd.Interval(3, 12, closed='left')))
    
    df[m]
        my_col my_col_bin
    2        3     [3, 6)
    3        4     [3, 6)
    4        5     [3, 6)
    5        6     [6, 9)
    6        7     [6, 9)
    7        8     [6, 9)
    8        9    [9, 12)
    9       10    [9, 12)
    10      11    [9, 12)