Search code examples
pythonpandasoutliers

Pandas replace by NaN if the difference with the previous row is above a treshold


I have an half hourly dataframe df from which i want to remove outliers.

date  = ['2015-02-03 23:00:00','2015-02-03 23:30:00','2015-02-04 00:00:00','2015-02-04 00:30:00']
value_column = [33.24  , 500  , 34.39  , 34.49 ]

df = pd.DataFrame({'value column':value_column})
df.index = pd.to_datetime(df['index'],format='%Y-%m-%d %H:%M')
df.drop(['index'],axis=1,inplace=True)

print(df.head())
                   value column  
index                                     
2015-02-03 23:00:00  33.24   
2015-02-03 23:30:00  500   
2015-02-04 00:00:00  34.39   
2015-02-04 00:30:00  34.49   

I want to remove outliers based on the difference of the values from one hour to the next. I would like to replace outliers values by NaN if the absolute difference from one hour to the next is above a given treshold. How can I do that efficiently?

I know that I can get the difference of the dataframe with the line below, however I do not know how to replace values by nan at the identified indexes where the difference is above the given treshold. Any idea on how to do that efficiently? (Assuming for instance that the treshold is 100)

df = df.diff()

I have tried the following, it does not throw any error but does not work:

df["value column"]=df["value column"].mask(df["value column"].diff().abs() > 100, np.nan) 

Expected results:

                   value column  
index                                     
2015-02-03 23:00:00  33.24   
2015-02-03 23:30:00  NaN   
2015-02-04 00:00:00  34.39   
2015-02-04 00:30:00  34.49   

Solution

  • You need to find the do diff from top and bot together or the row number 3 will be droped as well

    df["value column"].mask((df["value column"].diff(-1).abs()>100) & (df["value column"].diff().abs() > 100), np.nan) 
    Out[270]: 
    0    33.24
    1      NaN
    2    34.39
    3    34.49
    Name: value column, dtype: float64