Search code examples
pythonpandasdataframecrosstab

How to sum based on the condition of two columns and presented in crosstab format using Pandas?


Given the dataframe below:

Name    Activity    Hour    Month
A       TT          5       1
A       TT          2       1
A       UU          1       1
A       UU          1       2
A       UU          1       3
B       TT          40      3
C       UU          10      1
D       TT          2       2
D       TT          2       2
D       TT          2       2
D       TT          5       1

The next step is to get the summation if the rows have identical value of the column Name and Activity.

For example, for the case Name: A and Activity: TT will give the summation of 7

Then, I would like to present it in a cross tab format which is group according to month and activity, as shown below

                Month
       1           2          3
    TT  UU      TT  UU      TT  UU
A   7   1       0   1       0   1
B   0   0       0   0       40  0
C   0   10      0   0       0   0
D   5   0       6   0       0   0

May I know whether this can achieve directly using pandas crosstab?

p.s., similar but different topic has been discussed separately here. Please let me know if this thread is considered to be a duplicate of the OP


Solution

  • Considering below df:

    In [93]: df
    Out[93]: 
       Name Activity  Hour  Month
    0     A       TT     5      1
    1     A       TT     2      1
    2     A       UU     1      1
    3     A       UU     1      2
    4     A       UU     1      3
    5     B       TT    40      3
    6     C       UU    10      1
    7     D       TT     2      2
    8     D       TT     2      2
    9     D       TT     2      2
    10    D       TT     5      1
    

    Solution:1 If you want using pd.crosstab, you can do this:

    In [92]: pd.crosstab(df.Name, columns=[df.Month, df.Activity], values=df.Hour, aggfunc='sum').fillna(0)
    Out[92]: 
    Month       1          2          3     
    Activity   TT    UU   TT   UU    TT   UU
    Name                                    
    A         7.0   1.0  0.0  1.0   0.0  1.0
    B         0.0   0.0  0.0  0.0  40.0  0.0
    C         0.0  10.0  0.0  0.0   0.0  0.0
    D         5.0   0.0  6.0  0.0   0.0  0.0
    

    Solution:2 You can use df.pivot_table:

    In [89]: df.pivot_table(index='Name', columns=['Month', 'Activity'], values='Hour', aggfunc='sum', fill_value=0)
    Out[89]: 
    Month     1      2      3   
    Activity TT  UU TT UU  TT UU
    Name                        
    A         7   1  0  1   0  1
    B         0   0  0  0  40  0
    C         0  10  0  0   0  0
    D         5   0  6  0   0  0