Search code examples
pythonpandastrimoutliers

Removing outliers and calculating a trimmed mean in Python for multiple columns with different number of actual values


I have a dataset. Let's say, 10010 rows and 100 columns, column values might include NaN and for each column of NaNs can be different.

I want

  • to pick n number of columns from this dataset (let's say 20, without order, e.g, Column1, Column2, etc).
  • trim outliers (2.5% of the highest and 2.5% of the lowest for each of selected columns), excluding NaN values (so if 10 values among 10010 are NaN in Column1, I need to trim out actual highest 250 values from the top and 250 actual lowest values from the bottom of 10000 values)
  • But if the Column2 has 110 NaN initially, I want to trim 2.5% percent from each side for the actual number of values (in this case 9900, not 10000 like in Column1 column)
  • Calculate trimmed mean for each of selected columns
  • Have a new dataset after trimming where all trimmed outliers were converted to NaN

Solution

  • This simplified example below shows an approach which might be useful and uses pd.quantile. The code could be developed to your requirement (obviously including the quantile parameter).

    import pandas as pd
    
    df = pd.DataFrame({'col1': [ 1, 2, 3, 4, None, 6, 7, 8, 54],
                       'col2': [3, 5, 13, 14, 2, 16, 17, 18, 19] })
    
    cols = ['col1', 'col2']
    for col in cols:
        lo = df[col].quantile(0.1)
        hi = df[col].quantile(0.9)
        df[col] = df[col].where((df[col]> lo) & (df[col] < hi), None)
        print(f'mean for {col} is: ', df[col].mean().round(2))
    
    
    print(df)
    

    gives:

    mean for col1 is:  5.0
    mean for col2 is:  12.29
    
       col1  col2
    0   NaN   3.0
    1   2.0   5.0
    2   3.0  13.0
    3   4.0  14.0
    4   NaN   NaN
    5   6.0  16.0
    6   7.0  17.0
    7   8.0  18.0
    8   NaN   NaN
    

    The code above uses a value threshold to change outliers to NaN; this would be the usual approach. If the requirement was the change a number of values at either extreme then this could be done through saving and manipulating the index, sorting by value, changing the outlier proportions then restoring the original order using the index. The code below assumes that the default numerical index has originally been used; if not then the user index would need to be saved then finally re-instated.

    cut_val = 0.2     # proportion of non_NaN values to remove from each extreme
    num_rows = len(df)
    
    cols = ['col1', 'col2']
    for col in cols:
        num_not_nan = num_rows - df[col].isna().sum()
        cut = int(num_not_nan*cut_val)
        dfx = df[col].sort_values()
        idx = dfx.index.to_list()   #save sorted index
        dfx.index = range(num_rows)       #use numerical re-index so .loc can be used
        dfx.loc[0:cut-1] = None
        dfx.loc[num_not_nan-cut:num_not_nan] = None
        dfx.index=idx              #impose original index
        df[col] = dfx.sort_index()
        print(f'mean for {col} is: ', df[col].mean().round(2))
    
    print(df)