Search code examples
pythonpython-3.xpandaslambdavectorization

How to replace the outlier with 3 standard deviation value for all column based on group by column?


Hi I have data frame in which I want to replace/cap the outlier with the 3*standard deviation value for all column with group by for each column. For example:

df = pd.DataFrame({"A":["A", "A", "A", "A", "B","B","B","B","B","B","B","B","B","B","B","B"], 
                   "B":[7, 2, 54, 3, 5,23,5,7,7,7,7,7,7,7,6,7], 
                   "C":[20, 16, 11, 3, 8,5,5,20,6,6,6,6,6,5,6,6], 
                   "D":[14, 3, 32, 2, 6,5,6,20,4,5,4,5,4,5,5,5],
                }) 

feature=['B','C','D']

mean = df.groupby('A')[feature].mean()
std = df.groupby('A')[feature].std()

now I want to replace outlier for each column in feature with appropriate standard deviation for that group.

Something like below but for each group and each column

for col in feature:
 for each in df['A'].unique():
  m=mean.loc[each,col]
  s=std.loc[each,col]
  df.loc[each,df[col]< m-3*s,]=m-3*s

Expected output:

enter image description here

I have many column and loop is time consuming. Is there any better way or can it be done with one loop?


Solution

  • You can do this in a vectorial way using a groupby.transform:

    cols = ['B', 'C', 'D']
    
    g = df.groupby('A')[cols].transform
    mean = g('mean')
    std = g('std')
    
    tmp = mean-3*std
    df[df[cols]< tmp] = tmp
    

    Output:

       A   B   C   D
    0  A   7  20  14
    1  B   2  16   3
    2  A  54  11  35
    3  A   3   3   2
    4  B   5   8   6
    5  B  23   5   5
    

    Intermediate tmp:

              B          C          D
    0 -63.74898 -14.181368 -33.109879
    1 -24.07345  -7.392055   0.084091
    2 -63.74898 -14.181368 -33.109879
    3 -63.74898 -14.181368 -33.109879
    4 -24.07345  -7.392055   0.084091
    5 -24.07345  -7.392055   0.084091