I have an half hourly dataframe df from which i want to remove outliers.
date = ['2015-02-03 23:00:00','2015-02-03 23:30:00','2015-02-04 00:00:00','2015-02-04 00:30:00']
value_column = [33.24 , 500 , 34.39 , 34.49 ]
df = pd.DataFrame({'value column':value_column})
df.index = pd.to_datetime(df['index'],format='%Y-%m-%d %H:%M')
df.drop(['index'],axis=1,inplace=True)
print(df.head())
value column
index
2015-02-03 23:00:00 33.24
2015-02-03 23:30:00 500
2015-02-04 00:00:00 34.39
2015-02-04 00:30:00 34.49
I want to remove outliers based on the difference of the values from one hour to the next. I would like to replace outliers values by NaN if the absolute difference from one hour to the next is above a given treshold. How can I do that efficiently?
I know that I can get the difference of the dataframe with the line below, however I do not know how to replace values by nan at the identified indexes where the difference is above the given treshold. Any idea on how to do that efficiently? (Assuming for instance that the treshold is 100)
df = df.diff()
I have tried the following, it does not throw any error but does not work:
df["value column"]=df["value column"].mask(df["value column"].diff().abs() > 100, np.nan)
Expected results:
value column
index
2015-02-03 23:00:00 33.24
2015-02-03 23:30:00 NaN
2015-02-04 00:00:00 34.39
2015-02-04 00:30:00 34.49
You need to find the do diff from top and bot together or the row number 3 will be droped as well
df["value column"].mask((df["value column"].diff(-1).abs()>100) & (df["value column"].diff().abs() > 100), np.nan)
Out[270]:
0 33.24
1 NaN
2 34.39
3 34.49
Name: value column, dtype: float64