Search code examples
pythonpandasdataframedata-cleaning

Filter pandas dataframe based on threshold value and including the row afterwards


I have a dataframe with several columns - for simplicity, column A is a column of integers that are strictly increasing.

A    B    ...
103
222
383
432
799
1089
...

I would like to filter the dataframe based on a threshold value for column A, e.g. 750. I can do something like df[df['A'] < 750] to achieve this. This results in:

A    B    ...
103
222
383
432
...

However, I would like to include the value directly after this threshold value as well - e.g. the resulting dataframe also should contain the 799 row.

A    B    ...
103
222
383
432
799
...

A clunky way to do this is just to pd.concat these two dataframes df[df['A'] < 750] and df[df['A'] >= 750].iloc[0] together - it is essentially the original filter plus the single row directly after 750, which is 799, hence the iloc[0]. Is there some sort of built in method achieve this result, without having to do this manual concat?


Solution

  • You could use Series.searchsorted assuming the dataframe is sorted by A values. And this should be very fast.

    By default the side is left for searchsorted and because you first want to filter < 750 first and then include the next higher element to 750 then side=left should be good.

    df.loc[:df['A'].searchsorted(750)]
    

    Output (with made up 'B' values):

          A     B
    0   103     1
    1   222     2
    2   383     3
    3   432     4
    4   799     5