Search code examples
pythonpandasoverlap

How to efficiently find overlapping intervals?


I have the following toy example dataframe, df:

      f_low    f_high
   0.476201  0.481915
   0.479161  0.484977
   0.485997  0.491911
   0.503259  0.508679
   0.504687  0.510075
   0.504687  0.670075
   0.666093  0.670438
   0.765602  0.770028
   0.766884  0.771307
   0.775986  0.780398
   0.794590  0.798965

to find overlapping subsets of this, I am using the following code:

df = df.sort_values('f_low')
for row in df.itertuples():
    iix = pd.IntervalIndex.from_arrays(df.f_low, df.f_high, closed='neither')
    span_range = pd.Interval(row.f_low, row.f_high)
    fx = df[(iix.overlaps(span_range))].copy()

I would LIKE to get overlapping dataframes like this:

   # iteration 1: over row.f_low=0.476201  row.f_high=0.481915 

      f_low    f_high
   0.476201  0.481915
   0.479161  0.484977

   # iteration 2: over row.f_low=0.503259  row.f_high=0.508679 
      f_low    f_high
   0.503259  0.508679 
   0.504687  0.510075
   0.504687 0.670075

   # iteration 3: over row.f_low=0.504687  row.f_high=0.670075 
      f_low    f_high
   0.666093  0.670438

etc.

This works great, but since the dataframe is quite large and there are a lot of overlaps, this takes a long time to process. Also, the interval I am testing for overlaps does not grab itself when using the Interval and overlaps methods for pandas.

What this is meant to represent is a series of overlapping confidence intervals with each row that gets iterated over.

Is there a way to more efficiently extract overlapping intervals against a given interval besides iterating through all the tuples?

Here is a chunk of the actual dataframe unsorted:

f_low   f_high
0.504687  0.670075
0.476201  0.481915
0.765602  0.770028
0.479161  0.484977
0.766884  0.771307
0.485997  0.491911
0.666093  0.670438
0.503259  0.508679
0.775986  0.780398
0.504687  0.510075
0.794590  0.798965

Solution

  • If I understand correctly, you want to separate your current df into data frames where the initial interval is set by the first row, and the second interval is defined by the first row that does not intersect, etc. The below method will do that and should be pretty efficient if the number of groups isn't too large:

    df = df.sort_values("f_low").reset_index(drop=True)
    idx = 0
    dfs = []
    while True:
        low = df.f_low[idx]
        high = df.f_high[idx]
        sub_df = df[(df.f_low <= high) & (low <= df.f_low)]
        dfs.append(sub_df)
        idx = sub_df.index.max() + 1
        if idx > df.index.max():
            break
    

    output:

    [      f_low    f_high
     0  0.476201  0.481915
     1  0.479161  0.484977,
           f_low    f_high
     2  0.485997  0.491911,
           f_low    f_high
     3  0.503259  0.508679
     4  0.504687  0.510075
     5  0.504687  0.670075,
           f_low    f_high
     6  0.666093  0.670438,
           f_low    f_high
     7  0.765602  0.770028
     8  0.766884  0.771307,
           f_low    f_high
     9  0.775986  0.780398,
           f_low    f_high
     10  0.79459  0.798965]