Search code examples
pythonpandasdataframepandas-groupby

How to groupby and drop rows that include "-inf" or "inf" and all the previous rows


I have the following df, I want to groupby "group" and drop 1) not only rows that include NaN, but all the preceding values or 2) rows that precede the last NaN.

group  date     value    
  1    jan2019    NaN
  1    jan2019    3
  1    jan2019    NaN
  1    feb2019    3        
  1    mar2019    4       
  1    mar2019    5        
  2    feb2019    0        
  2    feb2019    NaN       
  2    mar2019    7        
  2    mar2019    4        
  2    apr2019    5        

desired df version 1.

group  date     value    
  1    feb2019    3        
  1    mar2019    4       
  1    mar2019    5                  
  2    mar2019    7        
  2    mar2019    4        
  2    apr2019    5        

desired df version 2.

group  date     value    
  1    jan2019    NaN
  1    feb2019    3        
  1    mar2019    4       
  1    mar2019    5        
  2    feb2019    NaN            
  2    mar2019    7        
  2    mar2019    4        
  2    apr2019    5        

Solution

  • You can flag the inf values using abs+eq. Then to flag all the values before it as well, you can reverse the order of the Series and use cummax. Since you want to do this job across groups, you can use groupby.cummax. Finally, use the boolean mask to filter the desired output via loc:

    out = df.loc[~df['value'].abs().eq(float('inf'))[::-1].groupby(df['group']).cummax()]
    

    If the values to flag are NaNs (rather than inf), then we could use isna instead:

    out = df.loc[~df['value'].isna()[::-1].groupby(df['group']).cummax()]
    

    Output:

       group     date  value
    2      1  feb2019    3.0
    3      1  mar2019    4.0
    4      1  mar2019    5.0
    7      2  mar2019    7.0
    8      2  mar2019    4.0
    9      2  apr2019    5.0
    

    For the third output, you can use groupby.shift:

    out = df.loc[(~df['value'].isna()[::-1].groupby(df['group']).cummax()).groupby(df['group']).shift().fillna(True)]
    

    Output:

        group     date  value
    2       1  jan2019    NaN
    3       1  feb2019    3.0
    4       1  mar2019    4.0
    5       1  mar2019    5.0
    7       2  feb2019    NaN
    8       2  mar2019    7.0
    9       2  mar2019    4.0
    10      2  apr2019    5.0