python pandas time-series outliers anomaly-detection

Modify outliers caused by sensor-failures in timeseries data

I am working with timeseries data collected from a sensor at 5min intervals. Unfortunately, there are cases when the measured value (PV yield in watts) is suddenly 0 or very high. The values before and after are correct:

My goal is to identify these 'outliers' and (in a second step) calculate the mean of the previous and next value to fix the measured value. I've experimented with two approaches so far, but am receiving many 'outliers' which are not measurement-errors. Hence, I am looking for better approaches.

Try 1: Classic outlier detection with IQR Source

def updateOutliersIQR(group):
  Q1 = group.yield.quantile(0.25)
  Q3 = group.yield.quantile(0.75)
  IQR = Q3 - Q1
  outliers = (group.yield < (Q1 - 1.5 * IQR)) | (group.yield > (Q3 + 1.5 * IQR))
  print(outliers[outliers == True]) 

# calling the function on a per-day level
df.groupby(df.index.date).apply(updateOutliers)

Try 2: kernel density estimation Source

def updateOutliersKDE(group):
  a = 0.9
  r = group.yield.rolling(3, min_periods=1, win_type='parzen').sum()
  n = r.max()
  outliers = (r > n*a)
  print(outliers[outliers == True]) 

# calling the function on a per-day level
df.groupby(df.index.date).apply(updateOutliers)

Try 3: Median Filter Source (As suggested by Jonnor)

def median_filter(num_std=3):
  def _median_filter(x):
    _median = np.median(x)
    _std = np.std(x)
    s = x[-3]
    if (s >= _median - num_std * _std and s <= _median + num_std * _std):
      return s
    else:
      return _median
  return _median_filter

# calling the function
df.yield.rolling(5, center=True).apply(median_filter(2), raw=True)

Edit: with try 3 and a window of 5 and std of 3, it finally catches the massive outlier, but will also loose accuracy of the other (non-faulty) sensor-measurements:

Are there any better approaches to detect the described 'outliers' or perform smoothing in timeseries data with the occasional sensor measurement issue?

Solution

Your abnormal values are abnormal in the sense that

the values deviate a lot from the values around it
the value changes very quickly from one time-step to the other

Thus what is needed is a filter that looks at a short time-context to filter these out.

One of the simplest and most effective is the median filter.

filtered = pandas.rolling_median(df, window=5)

The longer the window, the stronger the filter.

An alternative would be a low-pass filter. Though setting an appropriate cutoff frequency can be harder, and it will impose a smoothness onto the signal.

One can of course create more custom filters as well. For example, compute the first-order difference, and reject changes higher than a certain threshold. You can plot a histogram of the differences to determine a threshold. Mark these as missing (NaN), and then impute the missing using median/mean.

If your goal is Anomaly Detection, you can also use an Autoencoder. I would expect PV output to have a very strong daily pattern. So training it on daily sequences should work quite well (provided you have enough data). This is much more complicated than a simple filter, but has the advantage of being able to detect many other kinds of anomalies as well, not just the pattern identified here.