Search code examples
pythonpandasdelete-row

Delete rows between NaN and a change in the column value


I am stuck on a problem which looks simple but for which I cannot find a proper solution.

Consider a given Pandas dataframe df, composed by multiple columns A1,A2, etc., and let Ai be one of its column filled for example as follows:

Ai
25
30
30
NaN
12
15
15
NaN

I would like to delete all the rows in df for which Ai values are between NaN and a "further change" in its value, so that my output (for column Ai) would be:

Ai
25
NaN
12
NaN

Any idea on how to do so would be very much appreciated. Thank you very much in advance.


Solution

  • update

    Similar to the previous solution but with a filter per group to keep the early duplicates

    m = df['Ai'].isna()
    df.loc[((m|m.shift(fill_value=True))
            .groupby(df['Ai'].ne(df['Ai'].shift()).cumsum())
            .filter(lambda d: d.sum()>0).index
           )]
    

    output:

         Ai
    0  25.0
    1  25.0
    2  25.0
    5   NaN
    6  30.0
    7  30.0
    9   NaN
    

    original answer

    This is equivalent to selecting the NaNs and line below. You could use a mask:

    m = df['Ai'].isna()
    df[m|m.shift(fill_value=True)]
    

    Output:

         Ai
    0  25.0
    3   NaN
    4  12.0
    7   NaN