Search code examples
pythonpandasintervals

Finding matching interval(s) in pandas Intervalindex


There's this interesting API called Intervalindex new in 0.20 that lets you create an index of intervals.

Given some sample data:

data = [(893.1516130000001, 903.9187099999999),
 (882.384516, 893.1516130000001),
 (817.781935, 828.549032)]

You can create the index like this:

idx = pd.IntervalIndex.from_tuples(data)

print(idx)
IntervalIndex([(893.151613, 903.91871], (882.384516, 893.151613], (817.781935, 828.549032]]
              closed='right',
              dtype='interval[float64]')

An interesting property of Intervals is that you can perform interval checks with in:

print(y[-1])
Interval(817.78193499999998, 828.54903200000001, closed='right')

print(820 in y[-1])
True

print(1000 in y[-1])
False

I would like to know how to apply this operation to the entire index. For example, given some number 900, how could I retrieve a boolean mask of intervals for which this number fits in?

I can think of:

m = [900 in y for y in idx]
print(m)
[True, False, False]

Are there better ways to do this?


Solution

  • If you are interested in performance, an IntervalIndex is optimized for searching. using .get_loc or .get_indexer uses an internally built IntervalTree (like a binary tree), which is constructed on first use.

    In [29]: idx = pd.IntervalIndex.from_tuples(data*10000)
    
    In [30]: %timeit -n 1 -r 1 idx.map(lambda x: 900 in x)
    92.8 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
    
    In [40]: %timeit -n 1 -r 1 idx.map(lambda x: 900 in x)
    42.7 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
    
    # construct tree and search
    In [31]: %timeit -n 1 -r 1 idx.get_loc(900)
    4.55 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
    
    # subsequently
    In [32]: %timeit -n 1 -r 1 idx.get_loc(900)
    137 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
    
    # for a single indexer you can do even better (note that this is
    # dipping into the impl a bit
    In [27]: %timeit np.arange(len(idx))[(900 > idx.left) & (900 <= idx.right)]
    203 µs ± 1.55 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    

    Note that .get_loc() returns an indexer (which is actually more useful than a boolean array, but they are convertible to each other).

    In [38]: idx.map(lambda x: 900 in x)
        ...: 
    Out[38]: 
    Index([ True, False, False,  True, False, False,  True, False, False,  True,
           ...
           False,  True, False, False,  True, False, False,  True, False, False], dtype='object', length=30000)
    
    In [39]: idx.get_loc(900)
        ...: 
    Out[39]: array([29997,  9987, 10008, ..., 19992, 19989,     0])
    

    Returning a boolean array is converted to an array of indexers

    In [5]: np.arange(len(idx))[idx.map(lambda x: 900 in x).values.astype(bool)]
    Out[5]: array([    0,     3,     6, ..., 29991, 29994, 29997])
    

    This is what .get_loc() and .get_indexer() return:

    In [6]: np.sort(idx.get_loc(900))
    Out[6]: array([    0,     3,     6, ..., 29991, 29994, 29997])