I have the following large dataset recording the result of a math competition among students in descending order of date: So for example, student 1 comes third in Race 1 while student 3 won Race 2, etc.
Race_ID Date Student_ID Rank Studying_hours
1 1/1/2023 1 3 2
1 1/1/2023 2 2 5
1 1/1/2023 3 1 7
1 1/1/2023 4 4 1
2 11/9/2022 1 2 4
2 11/9/2022 2 3 2
2 11/9/2022 3 1 8
3 17/4/2022 5 4 3
3 17/4/2022 2 1 7
3 17/4/2022 3 2 2
3 17/4/2022 4 3 3
4 1/3/2022 1 3 7
4 1/3/2022 2 2 2
5 1/1/2021 1 2 2
5 1/1/2021 2 3 3
5 1/1/2021 3 1 6
and I want to generate a new column called "winning_past_studying_hours" which is the average studying hours of his past competitions and where he ended up with Rank 1 or 2.
So for example, for student 1:
Race_ID Date Student_ID Rank Studying_hours
1 1/1/2023 1 3 2
2 11/9/2022 1 2 4
4 1/3/2022 1 3 7
5 1/1/2021 1 2 2
the column looks like
Race_ID Date Student_ID Rank Studying_hours winning_past_studying_hours
1 1/1/2023 1 3 2 (4+2)/2 = 3
2 11/9/2022 1 2 4 2/1 = 2
4 1/3/2022 1 3 7 2/1= 2
5 1/1/2021 1 2 2 NaN
Similarly, for student 2:
Race_ID Date Student_ID Rank Studying_hours
1 1/1/2023 2 2 5
2 11/9/2022 2 3 2
3 17/4/2022 2 1 7
4 1/3/2022 2 2 2
5 1/1/2021 2 3 3
The column looks like
Race_ID Date Student_ID Rank Studying_hours winning_past_studying_hours
1 1/1/2023 2 2 5 (7+2)/2=4.5
2 11/9/2022 2 3 2 (7+2)/2=4.5
3 17/4/2022 2 1 7 2/1=2
4 1/3/2022 2 2 2 NaN
5 1/1/2021 2 3 3 NaN
I know the basic groupby
and mean
function but I do not know how to include the condition Rank.isin([1,2])
in the groupby
function. Thank you so much.
EDIT: Desired output:
Race_ID Date Student_ID Rank Studying_hours winning_past_studying_hours
1 1/1/2023 1 3 2 3
1 1/1/2023 2 2 5 4.5
1 1/1/2023 3 1 7 5.333
1 1/1/2023 4 4 1 NaN
2 11/9/2022 1 2 4 2
2 11/9/2022 2 3 2 4.5
2 11/9/2022 3 1 8 4
3 17/4/2022 5 4 3 NaN
3 17/4/2022 2 1 7 2
3 17/4/2022 3 2 2 6
3 17/4/2022 4 3 3 NaN
4 1/3/2022 1 3 7 2
4 1/3/2022 2 2 2 NaN
5 1/1/2021 1 2 2 NaN
5 1/1/2021 2 3 3 NaN
5 1/1/2021 3 1 6 NaN
We replace the studying hours for every competition a student didn't "win" with np.NaN
which has no impact on the calculation of the mean.
Use a window function rolling
with a large number to get an expanding window over the entries and compute the past running mean by specifying closed='left'
which discards the most recent entry.
Then we join back.
large_number=100000
df = pd.DataFrame(data)
df['Date']=pd.to_datetime(df['Date'])
df['Studying_hours']=((df.Rank<3)*df.Studying_hours).replace({0:np.NaN}) # This is more performant than an apply with a lambda
winning=df.sort_values('Date').groupby('Student_ID')['Studying_hours'].rolling(large_number,closed='left',min_periods=1).mean()
df['past_winning_hours_mean']=winning.reset_index(level=0, drop=True)
Test:
>>> df.sort_values(['Date', 'Student_ID'])
Output:
Race_ID Date Student_ID Rank Studying_hours past_winning_hours_mean
13 5 2021-01-01 1 2 2.0 NaN
14 5 2021-01-01 2 3 NaN NaN
15 5 2021-01-01 3 1 6.0 NaN
11 4 2022-01-03 1 3 NaN 2.000000
12 4 2022-01-03 2 2 2.0 NaN
8 3 2022-04-17 2 1 7.0 2.000000
9 3 2022-04-17 3 2 2.0 6.000000
10 3 2022-04-17 4 3 NaN NaN
7 3 2022-04-17 5 4 NaN NaN
4 2 2022-11-09 1 2 4.0 2.000000
5 2 2022-11-09 2 3 NaN 4.500000
6 2 2022-11-09 3 1 8.0 4.000000
0 1 2023-01-01 1 3 NaN 3.000000
1 1 2023-01-01 2 2 5.0 4.500000
2 1 2023-01-01 3 1 7.0 5.333333
3 1 2023-01-01 4 4 NaN NaN
I profiled this code on a dataset with 30000 rows:
6.07 ms ± 45.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)