Search code examples
pythonpandasdataframegroup-by

Pandas .groupby and .mean() based on conditions


I have the following large dataset recording the result of a math competition among students in descending order of date: So for example, student 1 comes third in Race 1 while student 3 won Race 2, etc.

Race_ID   Date           Student_ID      Rank    Studying_hours
1         1/1/2023       1               3       2
1         1/1/2023       2               2       5
1         1/1/2023       3               1       7
1         1/1/2023       4               4       1
2         11/9/2022      1               2       4
2         11/9/2022      2               3       2
2         11/9/2022      3               1       8
3         17/4/2022      5               4       3
3         17/4/2022      2               1       7
3         17/4/2022      3               2       2
3         17/4/2022      4               3       3
4         1/3/2022       1               3       7
4         1/3/2022       2               2       2 
5         1/1/2021       1               2       2
5         1/1/2021       2               3       3
5         1/1/2021       3               1       6

and I want to generate a new column called "winning_past_studying_hours" which is the average studying hours of his past competitions and where he ended up with Rank 1 or 2.

So for example, for student 1:

Race_ID Date    Student_ID  Rank    Studying_hours
1   1/1/2023    1           3       2
2   11/9/2022   1           2       4
4   1/3/2022    1           3       7
5   1/1/2021    1           2       2

the column looks like

Race_ID Date    Student_ID  Rank    Studying_hours    winning_past_studying_hours
1   1/1/2023    1           3       2                 (4+2)/2 = 3
2   11/9/2022   1           2       4                 2/1 = 2
4   1/3/2022    1           3       7                 2/1= 2
5   1/1/2021    1           2       2                 NaN

Similarly, for student 2:

Race_ID Date    Student_ID  Rank    Studying_hours
1   1/1/2023    2           2       5
2   11/9/2022   2           3       2
3   17/4/2022   2           1       7
4   1/3/2022    2           2       2
5   1/1/2021    2           3       3

The column looks like

Race_ID Date    Student_ID  Rank    Studying_hours    winning_past_studying_hours
1   1/1/2023    2           2       5                 (7+2)/2=4.5
2   11/9/2022   2           3       2                 (7+2)/2=4.5
3   17/4/2022   2           1       7                 2/1=2
4   1/3/2022    2           2       2                 NaN
5   1/1/2021    2           3       3                 NaN

I know the basic groupby and mean function but I do not know how to include the condition Rank.isin([1,2]) in the groupby function. Thank you so much.

EDIT: Desired output:

Race_ID   Date           Student_ID      Rank    Studying_hours  winning_past_studying_hours
1         1/1/2023       1               3       2               3
1         1/1/2023       2               2       5               4.5
1         1/1/2023       3               1       7               5.333
1         1/1/2023       4               4       1               NaN
2         11/9/2022      1               2       4               2
2         11/9/2022      2               3       2               4.5
2         11/9/2022      3               1       8               4
3         17/4/2022      5               4       3               NaN
3         17/4/2022      2               1       7               2
3         17/4/2022      3               2       2               6
3         17/4/2022      4               3       3               NaN
4         1/3/2022       1               3       7               2
4         1/3/2022       2               2       2               NaN
5         1/1/2021       1               2       2               NaN
5         1/1/2021       2               3       3               NaN
5         1/1/2021       3               1       6               NaN

Solution

  • We replace the studying hours for every competition a student didn't "win" with np.NaN which has no impact on the calculation of the mean.

    Use a window function rolling with a large number to get an expanding window over the entries and compute the past running mean by specifying closed='left' which discards the most recent entry.

    Then we join back.

    
    
    large_number=100000
    
    df = pd.DataFrame(data)
    df['Date']=pd.to_datetime(df['Date'])
    df['Studying_hours']=((df.Rank<3)*df.Studying_hours).replace({0:np.NaN}) # This is more performant than an apply with a lambda
    winning=df.sort_values('Date').groupby('Student_ID')['Studying_hours'].rolling(large_number,closed='left',min_periods=1).mean()
    df['past_winning_hours_mean']=winning.reset_index(level=0, drop=True)
    

    Test:

    >>> df.sort_values(['Date', 'Student_ID'])
    

    Output:

    Race_ID Date    Student_ID  Rank    Studying_hours  past_winning_hours_mean
    13  5   2021-01-01  1   2   2.0 NaN
    14  5   2021-01-01  2   3   NaN NaN
    15  5   2021-01-01  3   1   6.0 NaN
    11  4   2022-01-03  1   3   NaN 2.000000
    12  4   2022-01-03  2   2   2.0 NaN
    8   3   2022-04-17  2   1   7.0 2.000000
    9   3   2022-04-17  3   2   2.0 6.000000
    10  3   2022-04-17  4   3   NaN NaN
    7   3   2022-04-17  5   4   NaN NaN
    4   2   2022-11-09  1   2   4.0 2.000000
    5   2   2022-11-09  2   3   NaN 4.500000
    6   2   2022-11-09  3   1   8.0 4.000000
    0   1   2023-01-01  1   3   NaN 3.000000
    1   1   2023-01-01  2   2   5.0 4.500000
    2   1   2023-01-01  3   1   7.0 5.333333
    3   1   2023-01-01  4   4   NaN NaN
    

    I profiled this code on a dataset with 30000 rows:

    6.07 ms ± 45.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)