Search code examples
pythonpandasdataframedata-cleaningoutliers

Outlier detection of time-series data


I have a pandas dataframe where I want to detect outliers on a single column. Please bear in mind that I am not experienced when it comes to data handling/cleaning.

The dataframe looks like this:

Time MW
2019-01-01 00:00:00 1234.0
2019-01-01 01:00:00 1234.5
2019-01-01 02:00:00 1235.2
2019-01-01 03:00:00 1235.1

You can see on the figure below that I do have some outliers, most noticably the one right above the 08:00:00 mark.

enter image description here

What I want is to detect these outliers and, instead of removing them, I want to modify them to be the mean of the 10 values that come before it (So the value of row 11 would be changed to the mean value of rows 1-10).

I have looked into calculating the z-score and finding outliers based on that, but it seems to focus on the standard deviation of the total dataset, instead of only a local range. You can see the code that I wrote for this below.

def dfCleaning(df):
    df.reset_index(drop=True, inplace=True)
    z = np.abs(stats.zscore(df['MW']))

    for x in np.where((z > 3) | (z < -3)):
        for index in x:
            df.loc[index, 'MW'] = df.iloc[max(0, index-11) : max(1,index-1)]['MW'].mean()

    return df

So if there is a way to detect outliers based on a range of rows in a dataset (perhaps 50 rows above and below the row in focus), I would highly appreciate if someone could explain it to me.


Solution

  • You can use:

    def zscore(s, window, thresh=3, return_all=False):
        roll = s.rolling(window=window, min_periods=1, center=True)
        avg = roll.mean()
        std = roll.std(ddof=0)
        z = s.sub(avg).div(std)   
        m = z.between(-thresh, thresh)
        
        if return_all:
            return z, avg, std, m
        return s.where(m, avg)
    
    df['MW2'] = zscore(df['MW'], window=50)
    

    full example:

    N = 1000
    np.random.seed(1)
    df = pd.DataFrame({'MW': np.sin(np.linspace(0, 10, num=N))+np.random.normal(scale=0.6, size=N)})
    
    z, avg, std, m = zscore(df['MW'], window=50, return_all=True)
    
    ax = plt.subplot()
    
    df['MW'].plot(label='data')
    avg.plot(label='mean')
    df.loc[~m, 'MW'].plot(label='outliers', marker='o', ls='')
    avg[~m].plot(label='replacement', marker='o', ls='')
    plt.legend()
    

    Output:

    enter image description here

    With thresh=2:

    enter image description here