Search code examples
pythonpandasoutliersrolling-computationanomaly-detection

How to realign Python fill_between with data points


I have been working on a project that involves time-series total electron content data. My goal is to apply statistical analysis and find anomalies in TEC due to earthquake.

I am following this research paper for sliding IQR method. But for some reason I am not getting results as shown in the paper with the given formulas. So I decided to use the rolling mean + 1.6 STD method.

The problem is when I use ax.fill_between(x1 = index, y1 = ub, y2=lb) method my confidence interval band is being plotted a step further than the data point. Please see the given figure for a better understanding.

Here's what I am currently doing:

df = DAEJ.copy()
window = 12
hourly = df.resample(rule="h").median()
hourly["ma"] = hourly["TEC"].rolling(window=window).mean()
hourly["hour"] = hourly.index.hour
hourly["std_err"] = hourly["TEC"].rolling(window=window).std()
hourly["ub"] = hourly["ma"] + (1.67* hourly["std_err"])
hourly["lb"] = hourly["ma"] - (1.67* hourly["std_err"])
hourly["sig2"] = hourly["TEC"].rolling(window=window).var()
hourly["kur"] = hourly["TEC"].rolling(window=window).kurt()
hourly["pctChange"] = hourly.TEC.pct_change(12, fill_method="bfill")
hourly = hourly.dropna()
dTEC = hourly[(hourly["TEC"] > hourly["ub"])]



fig, ax = plt.subplots(figsize=(12,4))
hourly["TEC"].plot(ax=ax, title="TEC Anomaly", label="Station: DAEJ")
ax.fill_between(x=hourly.index, y1=hourly["ub"], y2=hourly["lb"], color="red", label="Conf Interval", alpha=.4)
ax.legend()

And here's the result I got:

TEC Anomaly detection using rolling std method

As seen in the given figure the data and the colored band aren't aligned properly. I know after calculating a rolling mean with a window of 12 hours will result in a 12 hour shifted value but even after dropping the first values I am still not getting an aligned figure.


Solution

  • You need to center your rolling windows with center=True:

    window = 12
    hourly = df.resample(rule="h").median()
    hourly["ma"] = hourly["TEC"].rolling(window=window, center=True).mean()
    hourly["hour"] = hourly.index.hour
    hourly["std_err"] = hourly["TEC"].rolling(window=window, center=True).std()
    hourly["ub"] = hourly["ma"] + (1.67* hourly["std_err"])
    hourly["lb"] = hourly["ma"] - (1.67* hourly["std_err"])
    hourly["sig2"] = hourly["TEC"].rolling(window=window, center=True).var()
    hourly["kur"] = hourly["TEC"].rolling(window=window, center=True).kurt()
    hourly["pctChange"] = hourly.TEC.pct_change(12, fill_method="bfill")
    hourly = hourly.dropna()
    dTEC = hourly[(hourly["TEC"] > hourly["ub"])]
    

    Alternatively, what you tried to do should have been done with shift:

    # your original code
    # ...
    
    # then shift
    ax.fill_between(x=hourly.index, y1=hourly["ub"].shift(-window//2),
                    y2=hourly["lb"].shift(-window//2),
                    color="red", label="Conf Interval", alpha=.4)
    

    Output:

    enter image description here