Search code examples
pythonpandasrolling-computationmoving-averageanomaly-detection

How to detect outlier in data using sliding IQR in Python/pandas?


Ok so I've been working on this project where I am trying to detect an anomaly and relate it to some certain phenomenon. I know that pandas have builtin functions i.e. pd.rolling(window= frequency).statistics_of_my_choice() but for some reasons I am not getting the desired results. I have calculated rolling mean, r.median, r.upper & lower = mean +- 1.6 r.std.

But when I plot it, the upper and lower bounds are always above the data. IDK what's happening here, it doesn't make sense. Please take a look at the figure for a better understanding.

Here's what I am getting:

Here's what I am getting

and here's what I want to achieve:

This is what I want to acheive

Here's the paper that I am trying to implement: https://www.researchgate.net/publication/374567172_Analysis_of_Ionospheric_Anomalies_before_the_Tonga_Volcanic_Eruption_on_15_January_2022/figures

Here's my code snippet

def gen_features(df):
    
    df["ma"] = df.TEC.rolling(window="h").mean()
    df["mstd"] = df.TEC.rolling(window="h").std()
    df["upper"] = df["ma"] + (1.6* df.mstd)
    df["lower"] = df["ma"] - (1.6* df.mstd)
    
    return df 

Solution

  • From the publication:

    "Since the solar activity cycle is 27 days, this paper uses 27 days as the sliding window to detect the ionospheric TEC perturbation condition before the volcanic eruption. The upper bound of TEC anomaly is represented as UB =Q2+ 1.5 IQR and the lower bound as LB =Q2−1.5IQR"

    Implementing this in pandas:

    # no seed for random, to try it many times
    dataLength = 1000 # datalength
    data = np.random.randint(1, 100, dataLength) # generate random data
    outlierPercentage = 1 # controls amount of outliers in the data
    outlierCount = int(dataLength/100 * outlierPercentage) # count of outliers
    outlierIdx = np.random.choice(dataLength, outlierCount, replace=False) # choose randomly between the index of the outlier
    data[outlierIdx] = np.random.randint(-300, 300, outlierCount) # choose a random int between -300 and 300
    df = pd.DataFrame({'Data': data}) # generate the datafrane
    winSize = 5 # define size of window 
    # the statistics calculations...
    Mean = df["Data"].rolling(window=winSize).mean()
    Q1 = df["Data"].rolling(window=winSize).quantile(0.25)
    Q3 = df["Data"].rolling(window=winSize).quantile(0.75)
    IQR = Q3 - Q1
    # assigning the upper limit and lower limit
    df["UL"] = Mean + 1.5 * IQR
    df["LL"] = Mean - 1.5 * IQR
    # detect the outliers
    outliersAboveUL = df[(df['Data'] > df['UL'])].index
    outliersBelowLL = df[(df['Data'] < df['LL'])].index
    

    Plotting gives you this:

    plot

    Imported packages:

    import pandas as pd
    %matplotlib notebook
    import matplotlib.pyplot as plt
    import numpy as np
    

    As you can see, this is a very basic example. I mainly added the correct calculation of the IQR. If you want a more detailed answer, I would need a sample of your data...

    V2.0: with data from OP

    This is currently what I have with the same approach:

    df = pd.read_csv("airaStation.csv", index_col=0, parse_dates=True)
    winSize = "29D" # define size of window 
    # the statistics calculations...
    Mean = df["TEC"].rolling(window=winSize).mean()
    Q1 = df["TEC"].rolling(window=winSize).quantile(0.25)
    Q3 = df["TEC"].rolling(window=winSize).quantile(0.75)
    IQR = Q3 - Q1
    # assigning the upper limit and lower limit
    df["UL"] = Mean + 1.5 * IQR
    df["LL"] = Mean - 1.5 * IQR
    # detect the outliers
    outliersAboveUL = df[(df['TEC'] > df['UL'])].index
    outliersBelowLL = df[(df['TEC'] < df['LL'])].index
    

    The plot:

    results