Search code examples
pythonpandasdataframemeanstandard-deviation

Finding mean/SD of a group of population and mean/SD of remaining population within a data frame


I have a pandas data frame that looks like this:

id  age  weight  group
1    12    45    [10-20]
1    18    110   [10-20]
1    25    25    [20-30]
1    29    85    [20-30]
1    32    49    [30-40]
1    31    70    [30-40]
1    37    39    [30-40] 

I am looking for a data frame that would look like this: (sd=standard deviation)

  group   group_mean_weight  group_sd_weight  rest_mean_weight  rest_sd_weight
 [10-20]                       
 [20-30] 
 [30-40] 

Here the second/third columns are mean and SD for that group. columns third and fourth are mean and SD for the rest of the groups combined.


Solution

  • Here's a way to do it:

    res = df.group.to_frame().groupby('group').count()
    for group in res.index:
        mask = df.group==group
        srGroup, srOther = df.loc[mask, 'weight'], df.loc[~mask, 'weight']
        res.loc[group, ['group_mean_weight','group_sd_weight','rest_mean_weight','rest_sd_weight']] = [
            srGroup.mean(), srGroup.std(), srOther.mean(), srOther.std()]
    res = res.reset_index()
    

    Output:

         group  group_mean_weight  group_sd_weight  rest_mean_weight  rest_sd_weight
    0  [10-20]          77.500000        45.961941             53.60       24.016661
    1  [20-30]          55.000000        42.426407             62.60       28.953411
    2  [30-40]          52.666667        15.821926             66.25       38.378596
    

    An alternative way to get the same result is:

    res = ( pd.DataFrame(
        df.group.drop_duplicates().to_frame()
            .apply(lambda x: [
                df.loc[df.group==x.group,'weight'].mean(), 
                df.loc[df.group==x.group,'weight'].std(), 
                df.loc[df.group!=x.group,'weight'].mean(), 
                df.loc[df.group!=x.group,'weight'].std()], axis=1, result_type='expand')
            .to_numpy(),
        index=list(df.group.drop_duplicates()),
        columns=['group_mean_weight','group_sd_weight','rest_mean_weight','rest_sd_weight'])
        .reset_index().rename(columns={'index':'group'}) )
    

    Output:

         group  group_mean_weight  group_sd_weight  rest_mean_weight  rest_sd_weight
    0  [10-20]          77.500000        45.961941             53.60       24.016661
    1  [20-30]          55.000000        42.426407             62.60       28.953411
    2  [30-40]          52.666667        15.821926             66.25       38.378596
    

    UPDATE: OP asked in a comment: "what if I have more than one weight column? what if I have around 10 different weight columns and I want sd for all weight columns?"

    To illustrate below, I have created two weight columns (weight and weight2) and have simply provided all 4 aggregates (mean, sd, mean of other, sd of other) for each weight column.

    wgtCols = ['weight','weight2']
    res = ( pd.concat([ pd.DataFrame(
        df.group.drop_duplicates().to_frame()
            .apply(lambda x: [
                df.loc[df.group==x.group,wgtCol].mean(), 
                df.loc[df.group==x.group,wgtCol].std(), 
                df.loc[df.group!=x.group,wgtCol].mean(), 
                df.loc[df.group!=x.group,wgtCol].std()], axis=1, result_type='expand')
            .to_numpy(),
        index=list(df.group.drop_duplicates()),
        columns=[f'group_mean_{wgtCol}',f'group_sd_{wgtCol}',f'rest_mean_{wgtCol}',f'rest_sd_{wgtCol}'])
        for wgtCol in wgtCols], axis=1)
        .reset_index().rename(columns={'index':'group'}) )
    

    Input:

       id  age  weight  weight2    group
    0   1   12      45       55  [10-20]
    1   1   18     110      120  [10-20]
    2   1   25      25       35  [20-30]
    3   1   29      85       95  [20-30]
    4   1   32      49       59  [30-40]
    5   1   31      70       80  [30-40]
    6   1   37      39       49  [30-40]
    

    Output:

         group  group_mean_weight  group_sd_weight  rest_mean_weight  rest_sd_weight  group_mean_weight2  group_sd_weight2  rest_mean_weight2  rest_sd_weight2
    0  [10-20]          77.500000        45.961941             53.60       24.016661           87.500000         45.961941              63.60        24.016661
    1  [20-30]          55.000000        42.426407             62.60       28.953411           65.000000         42.426407              72.60        28.953411
    2  [30-40]          52.666667        15.821926             66.25       38.378596           62.666667         15.821926              76.25        38.378596