I am trying to summarize data that involves nested group-by's and summarizing things by a user ID. There has to be an easier way to do this! This would be easy in SQL...
The data is at a granularity of user x session x activity. Activities are nested within nodes. I need to roll it up to the USER (aka student) level, and count unique activities and unique nodes that satisfy specific criteria.
STUDENT_ID STUDENT_ACTIVITY_SESSION_ID ACTIVITY_START_TIMESTAMP MINUTES NODE_NAME ACTIVITY_NAME IS_PRETEST_ACTIVITY min_node_grade FredID gobbledeegook1 11/10/2022 2 Node1 MyActivity1 1 Kindegarten FredID gobbledeegook2 3/8/2023 3 Node1 MyActivity1 1 Kindegarten FredID gobbledeegook3 12/1/2022 4 Node2 MyActivity2 0 1st Grade JaniceID gobbledeegook4 3/1/2023 5 Node3 MyActivity3 1 2nd Grade JaniceID gobbledeegook5 3/8/2023 6 Node3 MyActivity4 0 2nd Grade
STUDENT_ID min_pretest min_not_pretest total_min unique_pretest_activities unique_non_pretest_activities UNIQUE_NODES K_nodes_started 1stgr_nodes_started 2ndgr_and_beyond_nodes_started FredID 5 4 1 1 2 1 1 0 JaniceID 5 6 1 1 1 0 0 2
I first created a dataframe with all the Student ID's and information about them.
Then I aim to add a column that is the result, by student, of grouping by activity ID, and then counting unique activities by student (by whether pretest is true or false). The below kind of works, but then I have to create a temporary df and merge back to df_by_student and assign it to a column... And I have to do this for each column. There has to be an easier way??
df_temp=\
df.loc[df['IS_PRETEST_ACTIVITY']==0, ['STUDENT_ID', 'ACTIVITY_ID']].value_counts().reset_index().groupby('STUDENT_ID', as_index=False)['ACTIVITY_ID'].count()
For the grades, I first create a by student-node-grade temporary df, df_std_node_grade
df_std_node_grade = df[['STUDENT_ID', 'NODE_ID', 'min_node_grade']].value_counts().reset_index().copy(deep=True)
Then I can summarize things by grade level
df_temp=df_std_node_grade.query("min_node_grade == 'Kindergarten'").groupby('STUDENT_ID', as_index=False)['NODE_ID'].count()
...but again I would need to create temporary df's by the student ID and then merge it back...
Thanks!
I hope I've understood your question right:
def my_group_func(x):
mask_is_pretest = x["IS_PRETEST_ACTIVITY"].eq(1)
min_pretest = x.loc[mask_is_pretest, "MINUTES"].sum()
min_not_pretest = x.loc[~mask_is_pretest, "MINUTES"].sum()
total_min = x["MINUTES"].sum()
unique_pretest = x.loc[mask_is_pretest, "ACTIVITY_NAME"].nunique()
unique_not_pretest = x.loc[~mask_is_pretest, "ACTIVITY_NAME"].nunique()
unique_nodes = x['NODE_NAME'].nunique()
k_nodes_started = x["min_node_grade"].eq("Kindegarten").sum()
first_nodes_started = x["min_node_grade"].eq("1st_Grade").sum()
second_and_beyound_nodes_started = (
(x["min_node_grade"] != "Kindegarten") & (x["min_node_grade"] != "1st_Grade")
).sum()
return pd.Series({
"min_pretest": min_pretest,
"min_not_pretest": min_not_pretest,
"total_min": total_min,
"unique_pretest": unique_pretest,
"unique_not_pretest": unique_not_pretest,
'unique_nodes': unique_nodes,
"k_nodes_started": k_nodes_started,
"first_nodes_started": first_nodes_started,
"second_and_beyound_nodes_started": second_and_beyound_nodes_started,
})
print(df.groupby("STUDENT_ID").apply(my_group_func).reset_index())
Prints:
STUDENT_ID min_pretest min_not_pretest total_min unique_pretest unique_not_pretest unique_nodes k_nodes_started first_nodes_started second_and_beyound_nodes_started
0 FredID 5 4 9 1 1 2 2 1 0
1 JaniceID 5 6 11 1 1 1 0 0 2