Search code examples
pythonpandasdataframesumaggregate-functions

build df from sum of column value


I need to group the data by customer_id and get the sum of purchase for each months. My data looks like this:

cust_id        months
1               1
1               1
1               2
1               4
2               1
2               1

So I need to see the sum of purchase for each months and each customer. The desired output is:

cust_id     mo1     mo2     mo3     mo4
1           2       1       0       1
1           2       0       0       0

Solution

  • Use crosstab with DataFrame.reindex for add missing categories:

    r = range(df['months'].min(), df['months'].max() + 1)
    df = (pd.crosstab(df['cust_id'],df['months'])
            .reindex(r, axis=1, fill_value=0)
            .add_prefix('mo'))
    print (df)
    months   mo1  mo2  mo3  mo4
    cust_id                    
    1          2    1    0    1
    2          2    0    0    0
    

    If need all months is possible use ordered categoricals:

    df['months'] = pd.Categorical(df['months'], ordered=True, categories=range(1, 13))
    
    df = df.groupby(['cust_id','months']).size().unstack(fill_value=0).add_prefix('mo')
    print (df)
    months   mo1  mo2  mo3  mo4  mo5  mo6  mo7  mo8  mo9  mo10  mo11  mo12
    cust_id                                                               
    1          2    1    0    1    0    0    0    0    0     0     0     0
    2          2    0    0    0    0    0    0    0    0     0     0     0
    

    Or reindex by range for all months:

    r = range(1, 13)
    df = (pd.crosstab(df['cust_id'],df['months'])
            .reindex(r, axis=1, fill_value=0)
            .add_prefix('mo'))
    print (df)
    months   mo1  mo2  mo3  mo4  mo5  mo6  mo7  mo8  mo9  mo10  mo11  mo12
    cust_id                                                               
    1          2    1    0    1    0    0    0    0    0     0     0     0
    2          2    0    0    0    0    0    0    0    0     0     0     0