Search code examples
pythonnumpyoutliers

Fixing outliers with an user defined function


I am trying to write a function which will fix the outliers in the dataset. i.e. If the outlier is above the upper bound the value will be replaced with upper bound and if the value is lesser than lower bound it will be replaced with lower bound. The function I created is listed below.

def fix_outliers(df):
    anomalies = []
    df_std = np.std(df)
    df_mean = np.mean(df)
    anomaly_cut_off = df_std * 3
    lower_limit  = df_mean - anomaly_cut_off 
    upper_limit = df_mean + anomaly_cut_off
    df=np.where(df > upper_limit, upper_limit, df)
    df=np.where(df < lower_limit, lower_limit, df)

The changes happening inside the function is not getting changed in my dataset. I am new to python and especially with functions. Any help would be appreciated. Thanks in advance.

Regards, Vin


Solution

  • Here is a proposition:

    def fix_outliers(df):
    
        df_fx=df.copy()
        df_fx_mean=np.nanmean(df_fx)
        df_fx_std=np.nanstd(df_fx)
    
        upper_limit=df_fx_mean+3*df_fx_std
        lower_limit=df_fx_mean-3*df_fx_std
    
        df_fx[df_fx>upper_limit]=upper_limit
        df_fx[df_fx<lower_limit]=lower_limit
    
        return df_fx
    
    df_fixed=fix_outliers(df)
    

    I think it's better to create a copy of the dataframe and then modify it instead of loosing the raw data. If you still only want to modify df:

    def fix_outliers(df):
        
        df_mean=np.nanmean(df)
        df_std=np.nanstd(df)
    
        upper_limit=df_mean+3*df_std
        lower_limit=df_mean-3*df_std
    
        df[df>upper_limit]=upper_limit
        df[df<lower_limit]=lower_limit
       
    
    fix_outliers(df)