Search code examples
pythonpandaspandas-groupbycategories

Groupby and Value Counting categories


I have a dataframe in which each line represents product sale. These are linked to order # (which can have multiple products) with price and color for each. I need to group these by Order # and get a column that counts each product type for that order row.

df = pd.DataFrame({'Product': ['X','X','Y','X','Y','W','W','Z','W','X'],
                   'Order #': ['01','01','02','03','03','03','04','05','05','05'],
                   'Price': [100,100,650,50,700,3000,2500,10,2500,150],
                   'Color': ['RED','BLUE','RED','RED','BLUE','GREEN','RED','BLUE','BLUE','GREEN']})

A 'regular' group-by expression using count is not what i am looking for.

# Aggregate 
ag_func = {'Product Quant.': pd.NamedAgg(column='Product', aggfunc='count'),
           'Total Price': pd.NamedAgg(column='Price', aggfunc='sum'),
           'Color Quant.': pd.NamedAgg(column='Color', aggfunc='count')}

# Test
test = df.groupby(pd.Grouper(key='Order #')).agg(**ag_func).reset_index()

groupby

I can solve this issue by using get_dummies for each category (product / color) and then using the sum aggregate function. This is fine for smaller datasets but in my real world case there are many dozens of categories, and new sets coming in with different categories all together...

This is the 'solution' i came up with

# Dummy 
df_dummy = pd.get_dummies(df, prefix='Type', prefix_sep=': ', columns=['Product', 'Color'])

ag_func2 = {'Product Quant.': pd.NamedAgg(column='Order #', aggfunc='count'),
            'W total': pd.NamedAgg(column='Type: W', aggfunc='sum'),
            'X total': pd.NamedAgg(column='Type: X', aggfunc='sum'),
            'Y total': pd.NamedAgg(column='Type: Y', aggfunc='sum'),
            'Z total': pd.NamedAgg(column='Type: Z', aggfunc='sum'),
            'Total Price': pd.NamedAgg(column='Price', aggfunc='sum'),
            'Color BLUE': pd.NamedAgg(column='Type: BLUE', aggfunc='sum'),
            'Color GREEN': pd.NamedAgg(column='Type: GREEN', aggfunc='sum'),
            'Color RED': pd.NamedAgg(column='Type: RED', aggfunc='sum')}

solution = df_dummy.groupby(pd.Grouper(key='Order #')).agg(**ag_func2).reset_index()

groupbydummy

Note the 2 X products on the 1st row and the 2 BLUES on the 5th row. This behaviour is what i need but this is too convoluted for repeated use on multiple datasets. I've tried to use pivot_tables but with no success.

Should i just define a function to go through categorical columns, dummy those and then group-by a set column using sum aggregation for the dummy variables?

Thanks


Solution

  • IIUC your problem being typing all the pd.NamedAgg for all the dummies created, maybe you can do your operations separately. First create the group object, then concat different operations on the different columns.

    gr = df.groupby('Order #')
    res = pd.concat([
        # equivalent to count the orders
        gr.size().to_frame(name='Product Quant.'),
        # equivalent to dummy then sum the dummy product columns
        gr['Product'].value_counts().unstack(fill_value=0).add_suffix(' Total'),
        # sum the price to get the total
        gr['Price'].sum().to_frame(name='Total Price'),
        # equivalent to sum the dummy color columns
        gr['Color'].value_counts().unstack(fill_value=0).add_prefix('Color ')
    ], axis=1)
    
    print(res)
             Product Quant.  W Total  X Total  Y Total  Z Total  Total Price  \
    Order #                                                                    
    01                    2        0        2        0        0          200   
    02                    1        0        0        1        0          650   
    03                    3        1        1        1        0         3750   
    04                    1        1        0        0        0         2500   
    05                    3        1        1        0        1         2660   
    
             Color BLUE  Color GREEN  Color RED  
    Order #                                      
    01                1            0          1  
    02                0            0          1  
    03                1            1          1  
    04                0            0          1  
    05                2            1          0  
    

    So basically in this case, groupby.value_counts.unstack is equivalent to get_dummies.groupby.sum.

    And for further use, instead of

    df.groupby('Order #')['Product'].value_counts().unstack(fill_value=0)
    

    you can do a pivot_table to get the same result:

    df.pivot_table(index='Order #', columns='Product', aggfunc='size', fill_value=0)