I am trying to get by-student columns that count up activities by their progress level.
STUDENT_ID STUDENT_ACTIVITY_SESSION_ID NODE_NAME ACTIVITY_NAME prog_level FredID gobbledeegook1 Node1 MyActivity1 pass FredID gobbledeegook2 Node1 MyActivity1 pass FredID gobbledeegook3 Node2 MyActivity2 pass JaniceID gobbledeegook4 Node3 MyActivity3 stay JaniceID gobbledeegook5 Node3 MyActivity3 stay JaniceID gobbledeegook5 Node3 MyActivity3 fail
STUDENT_ID attempts_pass attempts_fail attempts_stay FredID 3 JaniceID 1 2
def std_attempts_by_prog_level(df):
dict_fields = {}
df_by_prog_level = df.groupby('prog_level')['STUDENT_ACTIVITY_SESSION_ID']
for name, group in df_by_prog_level:
x = group.count()
dict_fields["attempts_" + name] = x
return pd.Series(dict_fields)
df.groupby('STUDENT_ID').apply(std_attempts_by_prog_level).reset_index()
result:
STUDENT_ID level_1 0 0 Fred attempts_cancel 104 1 Fred attempts_fail 96 2 Fred attempts_in_progress 30
...so this would need to be pivoted and messed with, so I tried just taking it from a pivot approach
df_temp=df.groupby(['STUDENT_ID', 'prog_level'],as_index=False)['STUDENT_ACTIVITY_SESSION_ID'].count().pivot(index='STUDENT_ID', columns='prog_level').rename({'cancel':'attempts_cancel', 'fail':'attempts_fail', 'in_progress':'attempts_in_progress', 'pass':'attempts_pass'}, axis=1)
print(df_temp.columns)
result:
MultiIndex([('STUDENT_ACTIVITY_SESSION_ID', 'attempts_cancel'),
('STUDENT_ACTIVITY_SESSION_ID', 'attempts_fail'),
('STUDENT_ACTIVITY_SESSION_ID', 'attempts_in_progress'),
('STUDENT_ACTIVITY_SESSION_ID', 'attempts_pass')],
names=[None, 'prog_level'])
You could use .pivot_table
:
result = df.pivot_table(
index="STUDENT_ID", columns="prog_level", values="ACTIVITY_NAME",
aggfunc="count", fill_value=0
).rename(lambda c: f"prog_level_{c}", axis=1).rename_axis(None, axis=1)
Result:
prog_level_fail prog_level_pass prog_level_stay
STUDENT_ID
FredID 0 3 0
JaniceID 1 0 2
If you want the index as column then add .reset_index()
at the end of the pipeline.