Search code examples
pythonpandasdataframegroup-by

adding column to dataframe using groupby on Date column


Existing Dataframe :

Sr.No        Date         Tag    score
01         10-02-2022    pass     10
02         10-02-2022    fail     5
03         10-02-2022    pass     10
04         11-02-2022    grace    3
05         11-02-2022    pass     15
06         11-02-2022    pass     15

Expected Dataframe :

Sr.No   Date       Tag  score no_of_records pass_count fail_count grace_count pass_score_total 
01    10-02-2022  pass   10         3           2          1          0             20
02    10-02-2022  fail   5          
03    10-02-2022  pass   10
04    11-02-2022  grace  3          3           2          0          1             30
05    11-02-2022  pass   15
06    11-02-2022  pass   15

no_of_records implies the no. of records for particular day. how to proceed with groupby on Date to get the desired output..?


Solution

  • First get sums for pass in s Series by aggregate sum, then get counts by Series.value_counts, join with crosstab:

    s = df['score'].where(df['Tag'].eq('pass'), 0).groupby(df['Date']).sum()
                    .rename('pass_score_total'))
    s1 = df['Date'].value_counts().rename('no_of_records')
    
    df1 = pd.concat([s1, pd.crosstab(df['Date'], df['Tag']).add_suffix('_count'), s], axis=1)
    df = df.join(df1, on='Date')
    print (df)
       Sr.No        Date    Tag  score  no_of_records  fail_count  grace_count  \
    0      1  10-02-2022   pass     10              3           1            0   
    1      2  10-02-2022   fail      5              3           1            0   
    2      3  10-02-2022   pass     10              3           1            0   
    3      4  11-02-2022  grace      3              3           0            1   
    4      5  11-02-2022   pass     15              3           0            1   
    5      6  11-02-2022   pass     15              3           0            1   
    
       pass_count  pass_score_total  
    0           2                20  
    1           2                20  
    2           2                20  
    3           2                30  
    4           2                30  
    5           2                30  
    

    If need empty strings:

    s = df['score'].where(df['Tag'].eq('pass'), 0).groupby(df['Date']).sum().rename('pass_score_total')
    s1 = df['Date'].value_counts().rename('no_of_records')
    
    df1 = pd.concat([s1, pd.crosstab(df['Date'], df['Tag']).add_suffix('_count'), s], axis=1)
    df = df.join(df1, on='Date')
    df[df1.columns] = df[df1.columns].mask(df['Date'].duplicated(), '', axis=1)
    print (df)
       Sr.No        Date    Tag  score no_of_records fail_count grace_count  \
    0      1  10-02-2022   pass     10             3          1           0   
    1      2  10-02-2022   fail      5                                        
    2      3  10-02-2022   pass     10                                        
    3      4  11-02-2022  grace      3             3          0           1   
    4      5  11-02-2022   pass     15                                        
    5      6  11-02-2022   pass     15                                        
    
      pass_count pass_score_total  
    0          2               20  
    1                              
    2                              
    3          2               30  
    4                              
    5