Search code examples
pythongroup-bypivot

loop through a group by, aggregate, and make new columns based on the groups


I am trying to get by-student columns that count up activities by their progress level.

Data looks like enter image description here

STUDENT_ID STUDENT_ACTIVITY_SESSION_ID NODE_NAME   ACTIVITY_NAME   prog_level
FredID  gobbledeegook1  Node1   MyActivity1 pass
FredID  gobbledeegook2  Node1   MyActivity1 pass
FredID  gobbledeegook3  Node2   MyActivity2 pass
JaniceID    gobbledeegook4  Node3   MyActivity3 stay
JaniceID    gobbledeegook5  Node3   MyActivity3 stay
JaniceID    gobbledeegook5  Node3   MyActivity3 fail

Here is what I want: enter image description here

STUDENT_ID attempts_pass   attempts_fail   attempts_stay
FredID  3       
JaniceID        1   2
  1. I attempt to loop through so variable names are automatic. I want each row to be a STUDENT_ID, and the count to be a column
def std_attempts_by_prog_level(df):
    dict_fields = {}
    df_by_prog_level = df.groupby('prog_level')['STUDENT_ACTIVITY_SESSION_ID']
    for name, group in df_by_prog_level:
        x = group.count() 
        dict_fields["attempts_" + name] = x

    return pd.Series(dict_fields)     
  
df.groupby('STUDENT_ID').apply(std_attempts_by_prog_level).reset_index()

result:

STUDENT_ID level_1 0
0   Fred    attempts_cancel 104
1   Fred    attempts_fail   96
2   Fred    attempts_in_progress    30

...so this would need to be pivoted and messed with, so I tried just taking it from a pivot approach

  1. Pivot approach and naming the fields manually: the resulting multi-index won't let me readily merge back with the other by-student metrics
df_temp=df.groupby(['STUDENT_ID', 'prog_level'],as_index=False)['STUDENT_ACTIVITY_SESSION_ID'].count().pivot(index='STUDENT_ID', columns='prog_level').rename({'cancel':'attempts_cancel', 'fail':'attempts_fail', 'in_progress':'attempts_in_progress', 'pass':'attempts_pass'}, axis=1)

print(df_temp.columns)

result:

MultiIndex([('STUDENT_ACTIVITY_SESSION_ID',      'attempts_cancel'),
            ('STUDENT_ACTIVITY_SESSION_ID',        'attempts_fail'),
            ('STUDENT_ACTIVITY_SESSION_ID', 'attempts_in_progress'),
            ('STUDENT_ACTIVITY_SESSION_ID',        'attempts_pass')],
           names=[None, 'prog_level'])

Solution

  • You could use .pivot_table:

    result = df.pivot_table(
        index="STUDENT_ID", columns="prog_level", values="ACTIVITY_NAME",
        aggfunc="count", fill_value=0
    ).rename(lambda c: f"prog_level_{c}", axis=1).rename_axis(None, axis=1)
    

    Result:

                prog_level_fail  prog_level_pass  prog_level_stay
    STUDENT_ID                                                   
    FredID                    0                3                0
    JaniceID                  1                0                2
    

    If you want the index as column then add .reset_index() at the end of the pipeline.