I am trying to do an outlier treatment on my time series data where I want to replace the values > 95th percentile with the 95th percentile and the values < 5th percentile with the 5th percentile value. I have prepared some code but I am unable to find the desired result.
I am trying to create a OutlierTreatment function using a sub- function called Cut. The code is given below
def outliertreatment(df,high_limit,low_limit):
df_temp=df['y'].apply(cut,high_limit,low_limit, extra_kw=1)
return df_temp
def cut(column,high_limit,low_limit):
conds = [column > np.percentile(column, high_limit),
column < np.percentile(column, low_limit)]
choices = [np.percentile(column, high_limit),
np.percentile(column, low_limit)]
return np.select(conds,choices,column)
I expect to send the dataframe, 95 as high_limit and 5 as low_limit in the OutlierTreatment function. How to achieve the desired result?
I'm not sure if this approach is a suitable way to deal with outliers, but to achieve what you want, clip
function is useful. It assigns values outside boundary to boundary values. You can read more in documentation.
data=pd.Series(np.random.randn(100))
data.clip(lower=data.quantile(0.05), upper=data.quantile(0.95))