I am trying to write a function which will fix the outliers in the dataset. i.e. If the outlier is above the upper bound the value will be replaced with upper bound and if the value is lesser than lower bound it will be replaced with lower bound. The function I created is listed below.
def fix_outliers(df):
anomalies = []
df_std = np.std(df)
df_mean = np.mean(df)
anomaly_cut_off = df_std * 3
lower_limit = df_mean - anomaly_cut_off
upper_limit = df_mean + anomaly_cut_off
df=np.where(df > upper_limit, upper_limit, df)
df=np.where(df < lower_limit, lower_limit, df)
The changes happening inside the function is not getting changed in my dataset. I am new to python and especially with functions. Any help would be appreciated. Thanks in advance.
Regards, Vin
Here is a proposition:
def fix_outliers(df):
df_fx=df.copy()
df_fx_mean=np.nanmean(df_fx)
df_fx_std=np.nanstd(df_fx)
upper_limit=df_fx_mean+3*df_fx_std
lower_limit=df_fx_mean-3*df_fx_std
df_fx[df_fx>upper_limit]=upper_limit
df_fx[df_fx<lower_limit]=lower_limit
return df_fx
df_fixed=fix_outliers(df)
I think it's better to create a copy of the dataframe and then modify it instead of loosing the raw data. If you still only want to modify df:
def fix_outliers(df):
df_mean=np.nanmean(df)
df_std=np.nanstd(df)
upper_limit=df_mean+3*df_std
lower_limit=df_mean-3*df_std
df[df>upper_limit]=upper_limit
df[df<lower_limit]=lower_limit
fix_outliers(df)