Search code examples
pythonpandasdataframeindexingdrop

Algorithmic efficiency of dropping a multitude of rows from a pandas dataframe


I have a large dataframe out of which I have identified index ranges which I want to drop.

To identify the correct rows I have an array that looks somewhat like this:

drop_pairs = [
["2020-01-01 10:00:00","2020-01-03 13:00:00"],
["2020-01-13 11:00:00","2020-01-13 13:33:00"],
["2020-01-22 00:07:00","2020-01-22 22:22:00"],
...]

And I'm trying to drop rows in my dataframe based on this array using:

for timerange in drop_pairs:
    df = df.drop(df.loc[f"{timerange[0]}":f"{timerange[1]}"].index)

But this method really is awfully slow with a large dataframe. How can I make this more efficient?


Solution

  • As hinted in the comments, this can also be done using boolean arrays. I've converted the array drop_pairs into two arrays drop_starts and drop_ends with the corresponding start and end values of the timeranges to be dropped.

    Next, I've created masks:

    mask_starts = df.index.isin(drop_starts).astype(int)
    mask_ends = df.index.isin(drop_ends).astype(int) * -1
    

    So that mask_starts + mask_ends would give an array the length of my df with 1s where a timerange to be dropped starts, -1s where it ends, and 0s everywhere else.

    I used np.cumsum to subsequently mark all the timestamps to be dropped with 1 and everything else with 0.

    drop_mask = np.cumsum(mask_starts + mask_ends)
    

    Based on this mask, it is easy to drop the concerning ranges:

    df = df[~drop_mask.astype(bool)]