Search code examples
pythonpandasgroup-by

summarizing data: how assign a nested group by to a new column


I am trying to summarize data that involves nested group-by's and summarizing things by a user ID. There has to be an easier way to do this! This would be easy in SQL...

The data is at a granularity of user x session x activity. Activities are nested within nodes. I need to roll it up to the USER (aka student) level, and count unique activities and unique nodes that satisfy specific criteria.

Start starts as: example fake data

STUDENT_ID STUDENT_ACTIVITY_SESSION_ID ACTIVITY_START_TIMESTAMP    MINUTES NODE_NAME   ACTIVITY_NAME   IS_PRETEST_ACTIVITY min_node_grade
FredID  gobbledeegook1  11/10/2022  2   Node1   MyActivity1 1   Kindegarten
FredID  gobbledeegook2  3/8/2023    3   Node1   MyActivity1 1   Kindegarten
FredID  gobbledeegook3  12/1/2022   4   Node2   MyActivity2 0   1st Grade
JaniceID    gobbledeegook4  3/1/2023    5   Node3   MyActivity3 1   2nd Grade
JaniceID    gobbledeegook5  3/8/2023    6   Node3   MyActivity4 0   2nd Grade

I want to get to: enter image description here

STUDENT_ID min_pretest min_not_pretest total_min   unique_pretest_activities   unique_non_pretest_activities   UNIQUE_NODES    K_nodes_started 1stgr_nodes_started 2ndgr_and_beyond_nodes_started
FredID  5   4       1   1   2   1   1   0
JaniceID    5   6       1   1   1   0   0   2

I first created a dataframe with all the Student ID's and information about them.

Then I aim to add a column that is the result, by student, of grouping by activity ID, and then counting unique activities by student (by whether pretest is true or false). The below kind of works, but then I have to create a temporary df and merge back to df_by_student and assign it to a column... And I have to do this for each column. There has to be an easier way??

df_temp=\
df.loc[df['IS_PRETEST_ACTIVITY']==0, ['STUDENT_ID', 'ACTIVITY_ID']].value_counts().reset_index().groupby('STUDENT_ID', as_index=False)['ACTIVITY_ID'].count()

For the grades, I first create a by student-node-grade temporary df, df_std_node_grade

df_std_node_grade = df[['STUDENT_ID', 'NODE_ID', 'min_node_grade']].value_counts().reset_index().copy(deep=True)

Then I can summarize things by grade level

df_temp=df_std_node_grade.query("min_node_grade == 'Kindergarten'").groupby('STUDENT_ID', as_index=False)['NODE_ID'].count()

...but again I would need to create temporary df's by the student ID and then merge it back...

Thanks!


Solution

  • I hope I've understood your question right:

    def my_group_func(x):
        mask_is_pretest = x["IS_PRETEST_ACTIVITY"].eq(1)
    
        min_pretest = x.loc[mask_is_pretest, "MINUTES"].sum()
        min_not_pretest = x.loc[~mask_is_pretest, "MINUTES"].sum()
        total_min = x["MINUTES"].sum()
        unique_pretest = x.loc[mask_is_pretest, "ACTIVITY_NAME"].nunique()
        unique_not_pretest = x.loc[~mask_is_pretest, "ACTIVITY_NAME"].nunique()
        unique_nodes = x['NODE_NAME'].nunique()
        k_nodes_started = x["min_node_grade"].eq("Kindegarten").sum()
        first_nodes_started = x["min_node_grade"].eq("1st_Grade").sum()
        second_and_beyound_nodes_started = (
            (x["min_node_grade"] != "Kindegarten") & (x["min_node_grade"] != "1st_Grade")
        ).sum()
    
        return pd.Series({
            "min_pretest": min_pretest,
            "min_not_pretest": min_not_pretest,
            "total_min": total_min,
            "unique_pretest": unique_pretest,
            "unique_not_pretest": unique_not_pretest,
            'unique_nodes': unique_nodes,
            "k_nodes_started": k_nodes_started,
            "first_nodes_started": first_nodes_started,
            "second_and_beyound_nodes_started": second_and_beyound_nodes_started,
        })
    
    
    print(df.groupby("STUDENT_ID").apply(my_group_func).reset_index())
    

    Prints:

      STUDENT_ID  min_pretest  min_not_pretest  total_min  unique_pretest  unique_not_pretest  unique_nodes  k_nodes_started  first_nodes_started  second_and_beyound_nodes_started
    0     FredID            5                4          9               1                   1             2                2                    1                                 0
    1   JaniceID            5                6         11               1                   1             1                0                    0                                 2