Search code examples
pythonpandasdataframegroup-bycumsum

counting the number of success after certain date every year in Pandas using groupby cumsum


I have a data frame that looks like

Date            Student_ID  Exam_Score
2020-12-24      1           79
2020-12-24      3           100
2020-12-24      4           88
2021-01-19      1           100
2021-01-19      2           100
2021-01-19      3           99
2021-01-19      4           72
2022-09-30      3           100
2022-09-30      2           100
2022-09-30      1           100
2022-09-30      5           46
2023-04-23      3           100
2023-04-23      2           97
2023-04-23      1           100
2024-07-19      2           89  
2024-07-19      1           100
2024-07-19      4           93
2024-07-19      3           100
2024-09-19      1           100 
2024-09-19      2           80
2024-09-19      3           100
2024-09-19      4           80
2024-10-20      1           80
2024-10-20      3           99
2024-10-20      2           80

And I would like to compute a new column called Recent_Full_Marks which uses the following logic: for each Student_ID, compute the number of times that student getting 100 marks on the exam before the current date and after the Sept-1 one year before. So for example, on 2024-11-22, Student 1 has gotten 2 full marks before 2024-11-22 and after 2024-09-01. And the desired column looks like:

Date            Student_ID  Exam_Score      Recent_Full_Marks
2020-12-24      1           79              0
2020-12-24      3           100             0
2020-12-24      4           88              0
2021-01-19      1           100             0
2021-01-19      2           100             0
2021-01-19      3           99              1
2021-01-19      4           72              0
2022-09-30      3           100             0
2022-09-30      2           100             0
2022-09-30      1           100             0
2022-09-30      5           46              0
2023-11-23      3           100             0
2023-11-23      2           97              0
2023-11-23      1           100             0
2024-07-19      2           89              0
2024-07-19      1           100             1
2024-07-19      4           93              0
2024-07-19      3           100             1
2024-09-19      1           100             0
2024-09-19      2           80              0
2024-09-19      3           100             0
2024-09-19      4           80              0
2024-10-20      1           100             1
2024-10-20      3           99              1
2024-10-20      2           80              0
2024-11-22      1           70              2
2024-11-22      3           100             1
2024-11-22      2           78              0

Here is what I have tried:

Date = pd.to_datetime(df['Date'], dayfirst=True)
full = (df.assign(Date=Date)
              .sort_values(['Student_ID','Date'], ascending=[True,True])
              ['Exam_Score'].eq(100))
df['Recent_Full_Marks']=(full.groupby([df['Student_ID'], Date.dt.year], group_keys=False).apply(lambda g: g.shift(1, fill_value=0).cumsum()))

However, the above method only counts the number of full marks after every year start and not every sept-1 and I was unable to modify it to work.


Solution

  • You could use a yearly period that ends on AUG-31 (Y-AUG) and groupby.transform with shift+cumsum:

    df['Recent_Full_Marks'] = (df
     .sort_values('Date')['Exam_Score'].eq(100).astype(int)
     .groupby([df['Student_ID'],
               pd.to_datetime(df['Date']).dt.to_period('Y-AUG')])
     .transform(lambda x: x.shift(fill_value=0).cumsum())
    )
    

    Output:

              Date  Student_ID  Exam_Score  Recent_Full_Marks
    0   2020-12-24           1          79                  0
    1   2020-12-24           3         100                  0
    2   2020-12-24           4          88                  0
    3   2021-01-19           1         100                  0
    4   2021-01-19           2         100                  0
    5   2021-01-19           3          99                  1
    6   2021-01-19           4          72                  0
    7   2022-09-30           3         100                  0
    8   2022-09-30           2         100                  0
    9   2022-09-30           1         100                  0
    10  2022-09-30           5          46                  0
    11  2023-11-23           3         100                  0
    12  2023-11-23           2          97                  0
    13  2023-11-23           1         100                  0
    14  2024-07-19           2          89                  0
    15  2024-07-19           1         100                  1
    16  2024-07-19           4          93                  0
    17  2024-07-19           3         100                  1
    18  2024-09-19           1         100                  0
    19  2024-09-19           2          80                  0
    20  2024-09-19           3         100                  0
    21  2024-09-19           4          80                  0
    22  2024-10-20           1         100                  1
    23  2024-10-20           3          99                  1
    24  2024-10-20           2          80                  0
    25  2024-11-22           1          70                  2
    26  2024-11-22           3         100                  1
    27  2024-11-22           2          78                  0