Search code examples
pythonpandasvectorizationstandard-deviation

Python Pandas calculate standard deviation excluding current group, with vectorization solution


So i want to calculate standard deviation excluding current group using groupby. Here an example of the data:

import pandas as pd
df = pd.DataFrame ({
                      'group' : ['A','A','A','A','A','A','B','B','B','B','B','B'],
                      'team'  : ['1','1','2','2','3','3','1','1','2','2','3','3',]
                      'value' : [1,2,5,7,2,3,7,8,8,9,6,4]
                  })

For example, for group A team 1, i want to calculate the std dev of team 2 and 3, for group A team 2, i want to calculate the std dev of group 1 and 3, and so on.

I managed to do it using groupby and apply but when using it on real data with literally milion of rows, it takes too long. So i am looking for a solution with vectorization.

def std(row, data):
   data = data.loc[data['group']==row['group]]
   return data.groupby(['team']).filter(lambda x:(x['tool]!=row['team']).all())['value'].std()
df['std_exclude'] = df.apply(lambda x: std(data=df),axis=1)

Solution

  • You can use transform after combining group and team as a list:

    df['std'] = (df.assign(new=df[['group', 'team']].values.tolist())['new'].transform(
                 lambda x: df[df['group'].eq(x[0]) & df['team'].ne(x[1])]['value'].std())) 
    

    Output:

    group team value    std
    0   A   1   1   2.217356
    1   A   1   2   2.217356
    2   A   2   5   0.816497
    3   A   2   7   0.816497
    4   A   3   2   2.753785
    5   A   3   3   2.753785
    6   B   1   7   2.217356
    7   B   1   8   2.217356
    8   B   2   8   1.707825
    9   B   2   9   1.707825
    10  B   3   6   0.816497
    11  B   3   4   0.816497
    

    There are some equal std values across different groups but you can verify that their std values are indeed equal.