Search code examples
pythonpython-3.xpandasdataframegroup-by

New column based on last time row value equals some numbers in Pandas dataframe


I have a dataframe sorted in descending order date that records the Rank of students in class and the predicted score.

Date        Student_ID  Rank    Predicted_Score
4/7/2021    33          2       87
13/6/2021   33          4       88
31/3/2021   33          7       88
28/2/2021   33          2       86
14/2/2021   33          10      86
31/1/2021   33          8       86
23/12/2020  33          1       81
8/11/2020   33          3       80
21/10/2020  33          3       80
23/9/2020   33          4       80
20/5/2020   33          3       80
29/4/2020   33          4       80
15/4/2020   33          2       79
26/2/2020   33          3       79
12/2/2020   33          5       79
29/1/2020   33          1       70

I want to create a column called Recent_Predicted_Score that record the last predicted_score where that student actually ranks top 3. So the desired outcome looks like

Date        Student_ID  Rank    Predicted_Score    Recent_Predicted_Score
4/7/2021    33          2       87                 86
13/6/2021   33          4       88                 86
31/3/2021   33          7       88                 86
28/2/2021   33          2       86                 81
14/2/2021   33          10      86                 81
31/1/2021   33          8       86                 81
23/12/2020  33          1       81                 80
8/11/2020   33          3       80                 80
21/10/2020  33          3       80                 80
23/9/2020   33          4       80                 80
20/5/2020   33          3       80                 79
29/4/2020   33          4       80                 79
15/4/2020   33          2       79                 79
26/2/2020   33          3       79                 70
12/2/2020   33          5       79                 70
29/1/2020   33          1       70

Here's what I have tried but it doesn't quite work, not sure if I am on the right track:

df.sort_values(by = ['Student_ID', 'Date'], ascending = [True, False], inplace = True)
lp1 = df['Predicted_Score'].where(df['Rank'].isin([1,2,3])).groupby(df['Student_ID']).bfill()
lp2 = df.groupby(['Student_ID', 'Rank'])['Predicted_Score'].shift(-1)
df = df.assign(Recent_Predicted_Score=lp1.mask(df['Rank'].isin([1,2,3]), lp2))

Thanks in advance.


Solution

  • Try:

    df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
    
    df = df.sort_values(['Student_ID', 'Date'])
    
    df['Recent_Predicted_Score'] = np.where(df['Rank'].isin([1, 2, 3]), df['Predicted_Score'], np.nan)
    df['Recent_Predicted_Score'] = df.groupby('Student_ID', group_keys=False)['Recent_Predicted_Score'].apply(lambda x: x.ffill().shift().fillna(''))
    
    df = df.sort_values(['Student_ID', 'Date'], ascending = [True, False])
    print(df)
    

    Prints:

             Date  Student_ID  Rank  Predicted_Score Recent_Predicted_Score
    0  2021-07-04          33     2               87                   86.0
    1  2021-06-13          33     4               88                   86.0
    2  2021-03-31          33     7               88                   86.0
    3  2021-02-28          33     2               86                   81.0
    4  2021-02-14          33    10               86                   81.0
    5  2021-01-31          33     8               86                   81.0
    6  2020-12-23          33     1               81                   80.0
    7  2020-11-08          33     3               80                   80.0
    8  2020-10-21          33     3               80                   80.0
    9  2020-09-23          33     4               80                   80.0
    10 2020-05-20          33     3               80                   79.0
    11 2020-04-29          33     4               80                   79.0
    12 2020-04-15          33     2               79                   79.0
    13 2020-02-26          33     3               79                   70.0
    14 2020-02-12          33     5               79                   70.0
    15 2020-01-29          33     1               70