Search code examples
pythonpandasdataframegroup-bystatistics

Groupby Pandas DataFrame and calculate mean and stdev of one column


I have a Pandas DataFrame as below:

   a      b      c      d
0  Apple  3      5      7
1  Banana 4      4      8
2  Cherry 7      1      3
3  Apple  3      4      7

I would like to group the rows by column 'a' while replacing values in column 'c' by the mean of values in grouped rows and add another column with std deviation of the values in column 'c' whose mean has been calculated. The values in column 'b' or 'd' are constant for all rows being grouped. So, the desired output would be:

   a      b      c      d      e
0  Apple  3      4.5    7      0.707107
1  Banana 4      4      8      0
2  Cherry 7      1      3      0

What is the best way to achieve this?


Solution

  • You could use a groupby-agg operation:

    In [38]: result = df.groupby(['a'], as_index=False).agg(
                          {'c':['mean','std'],'b':'first', 'd':'first'})
    

    and then rename and reorder the columns:

    In [39]: result.columns = ['a','c','e','b','d']
    
    In [40]: result.reindex(columns=sorted(result.columns))
    Out[40]: 
            a  b    c  d         e
    0   Apple  3  4.5  7  0.707107
    1  Banana  4  4.0  8       NaN
    2  Cherry  7  1.0  3       NaN
    

    Pandas computes the sample std by default. To compute the population std:

    def pop_std(x):
        return x.std(ddof=0)
    
    result = df.groupby(['a'], as_index=False).agg({'c':['mean',pop_std],'b':'first', 'd':'first'})
    
    result.columns = ['a','c','e','b','d']
    result.reindex(columns=sorted(result.columns))
    

    yields

            a  b    c  d    e
    0   Apple  3  4.5  7  0.5
    1  Banana  4  4.0  8  0.0
    2  Cherry  7  1.0  3  0.0