Search code examples
pythonpandasdataframestatisticsoutliers

Remove outliers while preserving the timestamps in dataframe


I have data in a dataframe in the format shown below:

metric  timestamp              cas_pre        fl_rat       ...
0       2017-04-06 11:25:00    687.982849     1627.040283    ...
1       2017-04-06 11:30:00    693.427673     1506.217285    ...
2       2017-04-06 11:35:00    692.686310     1537.114807    ...
....
45      2017-04-06 11:35:00    51987.427673   1537.114807    ...
....
101003  2017-04-06 11:35:00    692.686310     1537.114807    ...

It's very clear that row 45 needs to be eliminated since it's an anomaly. There are multiple columns and quite a few rows (100,000+). Now I want to remove the outliers from this, and have been using the code:

drop_df = df.drop(columns=['timestamp'])
drop_df = drop_df[(np.abs(stats.zscore(drop_df)) < 3).all(axis=1)]

However, this would give me the data without the timestamps. This is due to the fact that I cannot use timestamps within the z-score calculation. However, I want to preserve the timestamps, the correlation to which is completely lost over the filtering with the z-score. This is shown below:

metric  timestamp              cas_pre        fl_rat       ...
0       2017-04-06 11:25:00    687.982849     1627.040283    ...
1       2017-04-06 11:30:00    693.427673     1506.217285    ...
2       2017-04-06 11:35:00    692.686310     1537.114807    ...
....
101003  2017-04-06 11:35:00    692.686310     1537.114807    ...

How can I achieve that?


Solution

  • It's probably better to explicitly set which columns to use for the z-score calculation:

    cols = ['cas_pre', 'fl_rat', ...]
    df = df[(np.abs(stats.zscore(df[cols])) < 3).all(axis=1)]
    

    Alternatively, you can drop the timestamp column only in the input to the z-score calculation:

    drop_df = df.drop(columns=['timestamp'])
    df = df[(np.abs(stats.zscore(drop_df)) < 3).all(axis=1)]