I have data in a dataframe in the format shown below:
metric timestamp cas_pre fl_rat ...
0 2017-04-06 11:25:00 687.982849 1627.040283 ...
1 2017-04-06 11:30:00 693.427673 1506.217285 ...
2 2017-04-06 11:35:00 692.686310 1537.114807 ...
....
45 2017-04-06 11:35:00 51987.427673 1537.114807 ...
....
101003 2017-04-06 11:35:00 692.686310 1537.114807 ...
It's very clear that row 45 needs to be eliminated since it's an anomaly. There are multiple columns and quite a few rows (100,000+). Now I want to remove the outliers from this, and have been using the code:
drop_df = df.drop(columns=['timestamp'])
drop_df = drop_df[(np.abs(stats.zscore(drop_df)) < 3).all(axis=1)]
However, this would give me the data without the timestamps. This is due to the fact that I cannot use timestamps within the z-score calculation. However, I want to preserve the timestamps, the correlation to which is completely lost over the filtering with the z-score. This is shown below:
metric timestamp cas_pre fl_rat ...
0 2017-04-06 11:25:00 687.982849 1627.040283 ...
1 2017-04-06 11:30:00 693.427673 1506.217285 ...
2 2017-04-06 11:35:00 692.686310 1537.114807 ...
....
101003 2017-04-06 11:35:00 692.686310 1537.114807 ...
How can I achieve that?
It's probably better to explicitly set which columns to use for the z-score calculation:
cols = ['cas_pre', 'fl_rat', ...]
df = df[(np.abs(stats.zscore(df[cols])) < 3).all(axis=1)]
Alternatively, you can drop the timestamp column only in the input to the z-score calculation:
drop_df = df.drop(columns=['timestamp'])
df = df[(np.abs(stats.zscore(drop_df)) < 3).all(axis=1)]