Search code examples
pythonpandasdataframetime-seriesoutliers

How to replace the outliers with the 95th and 5th percentile in Python?


I am trying to do an outlier treatment on my time series data where I want to replace the values > 95th percentile with the 95th percentile and the values < 5th percentile with the 5th percentile value. I have prepared some code but I am unable to find the desired result.

I am trying to create a OutlierTreatment function using a sub- function called Cut. The code is given below

def outliertreatment(df,high_limit,low_limit):
    df_temp=df['y'].apply(cut,high_limit,low_limit, extra_kw=1)
    return df_temp
def cut(column,high_limit,low_limit):
    conds = [column > np.percentile(column, high_limit),
             column < np.percentile(column, low_limit)]
    choices = [np.percentile(column, high_limit),
            np.percentile(column, low_limit)]
    return np.select(conds,choices,column)  

I expect to send the dataframe, 95 as high_limit and 5 as low_limit in the OutlierTreatment function. How to achieve the desired result?


Solution

  • I'm not sure if this approach is a suitable way to deal with outliers, but to achieve what you want, clip function is useful. It assigns values outside boundary to boundary values. You can read more in documentation.

    data=pd.Series(np.random.randn(100))
    data.clip(lower=data.quantile(0.05), upper=data.quantile(0.95))