I have the following toy example dataframe, df
:
f_low f_high
0.476201 0.481915
0.479161 0.484977
0.485997 0.491911
0.503259 0.508679
0.504687 0.510075
0.504687 0.670075
0.666093 0.670438
0.765602 0.770028
0.766884 0.771307
0.775986 0.780398
0.794590 0.798965
to find overlapping subsets of this, I am using the following code:
df = df.sort_values('f_low')
for row in df.itertuples():
iix = pd.IntervalIndex.from_arrays(df.f_low, df.f_high, closed='neither')
span_range = pd.Interval(row.f_low, row.f_high)
fx = df[(iix.overlaps(span_range))].copy()
I would LIKE to get overlapping dataframes like this:
# iteration 1: over row.f_low=0.476201 row.f_high=0.481915
f_low f_high
0.476201 0.481915
0.479161 0.484977
# iteration 2: over row.f_low=0.503259 row.f_high=0.508679
f_low f_high
0.503259 0.508679
0.504687 0.510075
0.504687 0.670075
# iteration 3: over row.f_low=0.504687 row.f_high=0.670075
f_low f_high
0.666093 0.670438
etc.
This works great, but since the dataframe is quite large and there are a lot of overlaps, this takes a long time to process. Also, the interval I am testing for overlaps does not grab itself when using the Interval
and overlaps
methods for pandas.
What this is meant to represent is a series of overlapping confidence intervals with each row that gets iterated over.
Is there a way to more efficiently extract overlapping intervals against a given interval besides iterating through all the tuples?
Here is a chunk of the actual dataframe unsorted:
f_low f_high
0.504687 0.670075
0.476201 0.481915
0.765602 0.770028
0.479161 0.484977
0.766884 0.771307
0.485997 0.491911
0.666093 0.670438
0.503259 0.508679
0.775986 0.780398
0.504687 0.510075
0.794590 0.798965
If I understand correctly, you want to separate your current df into data frames where the initial interval is set by the first row, and the second interval is defined by the first row that does not intersect, etc. The below method will do that and should be pretty efficient if the number of groups isn't too large:
df = df.sort_values("f_low").reset_index(drop=True)
idx = 0
dfs = []
while True:
low = df.f_low[idx]
high = df.f_high[idx]
sub_df = df[(df.f_low <= high) & (low <= df.f_low)]
dfs.append(sub_df)
idx = sub_df.index.max() + 1
if idx > df.index.max():
break
output:
[ f_low f_high
0 0.476201 0.481915
1 0.479161 0.484977,
f_low f_high
2 0.485997 0.491911,
f_low f_high
3 0.503259 0.508679
4 0.504687 0.510075
5 0.504687 0.670075,
f_low f_high
6 0.666093 0.670438,
f_low f_high
7 0.765602 0.770028
8 0.766884 0.771307,
f_low f_high
9 0.775986 0.780398,
f_low f_high
10 0.79459 0.798965]