Search code examples
pythonpandasdataframepivotdata-cleaning

Python pandas groupby aggregate on multiple columns, then pivot


In Python, I have a pandas DataFrame similar to the following:

Item | shop1 | shop2 | shop3 | Category
------------------------------------
Shoes| 45    | 50    | 53    | Clothes
TV   | 200   | 300   | 250   | Technology
Book | 20    | 17    | 21    | Books
phone| 300   | 350   | 400   | Technology

Where shop1, shop2 and shop3 are the costs of every item in different shops. Now, I need to return a DataFrame, after some data cleaning, like this one:

Category (index)| size| sum| mean | std
----------------------------------------

where size is the number of items in each Category and sum, mean and std are related to the same functions applied to the 3 shops. How can I do these operations with the split-apply-combine pattern (groupby, aggregate, apply,...) ?

Can someone help me out? I'm going crazy with this one...thank you!


Solution

  • Edited for Pandas 0.22+ considering the deprecation of the use of dictionaries in a group by aggregation.

    We set up a very similar dictionary where we use the keys of the dictionary to specify our functions and the dictionary itself to rename the columns.

    rnm_cols = dict(size='Size', sum='Sum', mean='Mean', std='Std')
    df.set_index(['Category', 'Item']).stack().groupby('Category') \
      .agg(rnm_cols.keys()).rename(columns=rnm_cols)
    
                Size   Sum        Mean        Std
    Category                                     
    Books          3    58   19.333333   2.081666
    Clothes        3   148   49.333333   4.041452
    Technology     6  1800  300.000000  70.710678
    

    option 1
    use agg ← link to docs

    agg_funcs = dict(Size='size', Sum='sum', Mean='mean', Std='std')
    df.set_index(['Category', 'Item']).stack().groupby(level=0).agg(agg_funcs)
    
                      Std   Sum        Mean  Size
    Category                                     
    Books        2.081666    58   19.333333     3
    Clothes      4.041452   148   49.333333     3
    Technology  70.710678  1800  300.000000     6
    

    option 2
    more for less
    use describe ← link to docs

    df.set_index(['Category', 'Item']).stack().groupby(level=0).describe().unstack()
    
                count        mean        std    min    25%    50%    75%    max
    Category                                                                   
    Books         3.0   19.333333   2.081666   17.0   18.5   20.0   20.5   21.0
    Clothes       3.0   49.333333   4.041452   45.0   47.5   50.0   51.5   53.0
    Technology    6.0  300.000000  70.710678  200.0  262.5  300.0  337.5  400.0