Search code examples
pythonpandasdataframenormalize

Normalize across DB then across Group


I'm trying to normalize continuous variables first across the entire DF, then normalize again within group.

Here is a sample DF

       ave_win_last5  id_race
0               6.00     6734
1               3.25     6734
2               6.75     6734
3               5.50     6734
4               5.50     6734

I'm able to normalize within the df by using

x_var['ave_win_last5'] = (x_var['ave_win_last5']-x_var['ave_win_last5'].mean())/x_var['ave_win_last5'].std()

However, when I attempt to then normalize within the group, the output is all NAN

x_var['ave_win_last5'] = (x_var['ave_win_last5'] -x_var.groupby('id_race')['ave_win_last5'].mean())/x_var.groupby('id_race')['ave_win_last5'].std()

      ave_win_last5  id_race
0                NaN     6734
1                NaN     6734
2                NaN     6734
3                NaN     6734
4                NaN     6734

I'm not sure why this is returning NaN.


Solution

  • One option is to use groupby.transform, and move the normalization logic into transform:

    df['ave_win_last5'] = df.groupby('id_race').ave_win_last5.transform(lambda s: (s - s.mean()) / s.std())
    
    df
    #   ave_win_last5  id_race
    #0       0.459335     6734
    #1      -1.645952     6734
    #2       1.033505     6734
    #3       0.076556     6734
    #4       0.076556     6734