I have a large dataframe out of which I have identified index ranges which I want to drop.
To identify the correct rows I have an array that looks somewhat like this:
drop_pairs = [
["2020-01-01 10:00:00","2020-01-03 13:00:00"],
["2020-01-13 11:00:00","2020-01-13 13:33:00"],
["2020-01-22 00:07:00","2020-01-22 22:22:00"],
...]
And I'm trying to drop rows in my dataframe based on this array using:
for timerange in drop_pairs:
df = df.drop(df.loc[f"{timerange[0]}":f"{timerange[1]}"].index)
But this method really is awfully slow with a large dataframe. How can I make this more efficient?
As hinted in the comments, this can also be done using boolean arrays. I've converted the array drop_pairs
into two arrays drop_starts
and drop_ends
with the corresponding start and end values of the timeranges to be dropped.
Next, I've created masks:
mask_starts = df.index.isin(drop_starts).astype(int)
mask_ends = df.index.isin(drop_ends).astype(int) * -1
So that mask_starts + mask_ends
would give an array the length of my df
with 1s where a timerange to be dropped starts, -1s where it ends, and 0s everywhere else.
I used np.cumsum
to subsequently mark all the timestamps to be dropped with 1 and everything else with 0.
drop_mask = np.cumsum(mask_starts + mask_ends)
Based on this mask, it is easy to drop the concerning ranges:
df = df[~drop_mask.astype(bool)]