Search code examples
pythonpandasstandardized

Standardize variable by group - why is the mean always zero?


I have the following data:

df = pd.DataFrame({'sound': ['A', 'B', 'B', 'A', 'B', 'A'],
                   'score': [10, 5, 6, 7, 11, 1]})
print(df)

  sound  score
0     A     10
1     B      5
2     B      6
3     A      7
4     B     11
5     A      1

If I standardize (i.e. Z score) the score variable, I get the following values. The mean of the new z column is basically 0, with SD of 1, both of which are expected for a standardized variable:

df['z'] = (df['score'] - df['score'].mean())/df['score'].std()
print(df)
print('Mean: {}'.format(df['z'].mean()))
print('SD: {}'.format(df['z'].std()))

  sound  score         z
0     A     10  0.922139
1     B      5 -0.461069
2     B      6 -0.184428
3     A      7  0.092214
4     B     11  1.198781
5     A      1 -1.567636
Mean: -7.401486830834377e-17
SD: 1.0

However, what I'm actually interested in is calculating Z scores based on group membership (sound). For example, if a score is from sound A, then convert that value to a Z score using the mean and SD of sound A values only. Likewise, sound B Z scores will only use mean and SD from sound B. This will obviously produce different values compared to regular Z score calculation:

df['zg'] = df.groupby('sound')['score'].transform(lambda x: (x - x.mean()) / x.std())
print(df)
print('Mean: {}'.format(df['zg'].mean()))
print('SD: {}'.format(df['zg'].std()))

  sound  score         z        zg
0     A     10  0.922139  0.872872
1     B      5 -0.461069 -0.725866
2     B      6 -0.184428 -0.414781
3     A      7  0.092214  0.218218
4     B     11  1.198781  1.140647
5     A      1 -1.567636 -1.091089
Mean: 3.700743415417188e-17
SD: 0.894427190999916

My question is: why is the mean of the group-based standardized values (zg) also basically equal to 0? Is this expected behaviour or is there an error in my calculation somewhere?

The z scores make sense because standardizing within a variable essentially forces the mean to 0. But the zg values are calculated using different means and SDs for each sound group, so I'm not sure why the mean of that new variable has also been set to 0.

The only situation where I can see this happening is if the sum of values > 0 is equal to sum of values < 0, which when averaged would cancel out to 0. This happens in a regular Z score calculation but I'm surprised that this also happens when operating across multiple groups like this...


Solution

  • I think it makes perfect sense. If E[abc | def ] is the expectation of abc given def), then in df['zg']:

    m1 = E['zg' | sound = 'A'] = (0.872872 + 0.218218 -1.091089)/3 ~ 0

    m2 = E['zg' | sound = 'B'] = (-0.725866 - 0.414781 + 1.140647)/3 ~ 0

    and

    E['zg'] = (m1+m2)/2 = (0.872872 + 0.218218 -1.091089 -0.725866 - 0.414781 + 1.140647)/6 ~ 0