Search code examples
pandasoutliersgroup

Pandas finding and replacing outliers based on a group of two columns


I'm having a bit of trouble finding outliers in a df based on groups and dates.

For exampe I have a df like and I would like to find and replace the outlier values (10 for the group A on date 2022-06-27 and 20 for the group B on 2022-06-27) with the median of the respective group (3 for the first outliers and 4 for the second).

However I'm having some trouble filtering the data and isolating the outliers and replacing them.

 index = [0,1,2,3,4,5,6,7,8,9,10,11]
s = pd.Series(['A','A','A','A','A','A','B','B','B','B','B','B'],index= index)
t = pd.Series(['2022-06-28','2022-06-28','2022-06-28','2022-06-27','2022-06-27','2022-06-27',
               '2022-06-28','2022-06-28','2022-06-28','2022-06-27','2022-06-27','2022-06-27'],index= index)
r = pd.Series([1,2,1,2,3,10,2,3,2,3,4,20],index= index)
df = pd.DataFrame(s,columns = ['group'])
df['date'] = t
df['vale'] = r

print (df)


     group        date  val
0      A  2022-06-28    1
1      A  2022-06-28    2
2      A  2022-06-28    1
3      A  2022-06-27    2
4      A  2022-06-27    3
5      A  2022-06-27   10
6      B  2022-06-28    2
7      B  2022-06-28    3
8      B  2022-06-28    2
9      B  2022-06-27    3
10     B  2022-06-27    4
11     B  2022-06-27   20

Thanks for the help!


Solution

  • First you can identify outliers. This code identifies any values that are greater than one standard deviation away from the mean.

    outliers = df.loc[(df.value - df.value.mean()).abs() > df.value.std() * 1].index

    Then you can determine the median of each group:

    medians = df.groupby('group')['value'].median()

    Finally, locate the outliers and replace with the medians:

    df.loc[outliers, 'value'] = medians.loc[df.loc[outliers, 'group']].to_list()

    All together it looks like:

    import pandas as pd
    index = [0,1,2,3,4,5,6,7,8,9,10,11]
    s = pd.Series(['A','A','A','A','A','A','B','B','B','B','B','B'],index= index)
    t = pd.Series(['2022-06-28','2022-06-28','2022-06-28','2022-06-27','2022-06-27','2022-06-27',
                   '2022-06-28','2022-06-28','2022-06-28','2022-06-27','2022-06-27','2022-06-27'],index= index)
    r = pd.Series([1,2,1,2,3,10,2,3,2,3,4,20],index= index)
    df = pd.DataFrame(s,columns = ['group'])
    df['date'] = t
    df['value'] = r
    outliers = df.loc[(df.value - df.value.mean()).abs() > df.value.std() * 1].index
    medians = df.groupby('group')['value'].median()
    df.loc[outliers, 'value'] = medians.loc[df.loc[outliers, 'group']].values
    

    Output:

       group        date  value
    0      A  2022-06-28      1
    1      A  2022-06-28      2
    2      A  2022-06-28      1
    3      A  2022-06-27      2
    4      A  2022-06-27      3
    5      A  2022-06-27      2
    6      B  2022-06-28      2
    7      B  2022-06-28      3
    8      B  2022-06-28      2
    9      B  2022-06-27      3
    10     B  2022-06-27      4
    11     B  2022-06-27      3