Search code examples
pythonpandasdataframelistgroup-by

Pandas create % and # distribution list in descending order for each group


I have a pandas dataframe like as below

data = {
    'cust_id': ['abc', 'abc', 'abc', 'abc', 'abc', 'abc', 'abc', 'abc', 'abc', 'abc'],
    'product_id': [12, 12, 12, 12, 12, 12, 12, 12, 12, 12],
    'purchase_country': ['India', 'India', 'India', 'Australia', 'Australia', 'Australia', 'Australia', 'Australia', 'Australia', 'Australia']
}
df = pd.DataFrame(data)

My objective is to do the below for each group of cust_id and product_id

a) create two output columns - 'pct_region_split' and 'num_region_split'

b) For 'pct_region_split' - store the % of country split. For ex: For the specific group shown in sample data, Australia - 70% (7 out of 10 is 70%) and India - 30% (3 out of 10 is 30%)

c) For 'num_region_split' - just store the no of rows for country value. For ex: For the specific group shown in sample data, Australia - 7 rows out of total 10 and India is 3 out of total 10.

b) Store the values in a list format (descending order). Meaning, Australia should appear first because it has 70% as the value (which is higher than India).

I tried the below but it is going no where

df['total_purchases'] = df.groupby(['cust_id', 'product_id'])['purchase_country'].transform('size')
df['unique_country'] = df.groupby(['cust_id', 'product_id'])['purchase_country'].transform('nunique')

Please do note that my real data has more than 1000 customers and 200 product combinations.

I expect my output in a new dataframe like as shown below for each cust and product_id combination

enter image description here


Solution

  • Use a custom function and groupby.apply:

    def f(g):
        s = g['purchase_country'].value_counts()
        return pd.Series({'num_region_split': ', '.join(s.index+':'+s.astype('str')),
                          'pct_region_split': ', '.join(s.index+':'+s.div(s.sum()).astype('str')),
                         })
    
    df.groupby(['cust_id', 'product_id'], as_index=False).apply(f)
    

    Output:

      cust_id  product_id      num_region_split          pct_region_split
    0     abc          12  Australia:7, India:3  Australia:0.7, India:0.3