Search code examples
rdataframetime-seriesdata.tablelag

Filter R data.frame column having increasing growth rate ordered by another column


The following dataset is a dummy example of the problem I am having.There are 3 columns in my data viz: Date PlayerName and Score .Thus each player's date-wise score is recorded. The task is to find that player with maximum TotalScore (Over all the observations) from the subset of players that fulfills the two following criteria:

  1. There should be a steady increase in players' yearly performance (meaning in each year total score should be larger than the previous year score for a player)
  2. The growth-rate of performance should also increase(meaning the rate of increase of yearly total scores should also increase over time)

Dataframe looks like :

date <- as.Date(x = c('2010/01/01','2010/02/02',
                      '2011/01/01','2011/02/02',
                      '2012/01/01','2012/02/02',
                      '2013/01/01','2013/02/02',
                      '2014/01/01','2014/02/02'),format = "%Y/%m/%d") #toy date column

PlayerName  <- rep(LETTERS[1:5],each=10) # Name Players as A:E
score <- c(100,150,270,300,400,
           100,120,200,400,900,
           100,80,130,70,300,
           100,120,230,650,870,
           100,90,110,450,342)
df <- data.table(date=date,Name=PlayerName,score=score)

> df
          date Name score
 1: 2010-01-01    A   100
 2: 2010-02-02    A   150
 3: 2011-01-01    A   270
 4: 2011-02-02    A   300
 5: 2012-01-01    A   400
 6: 2012-02-02    A   100
 7: 2013-01-01    A   120
 8: 2013-02-02    A   200
 9: 2014-01-01    A   400
10: 2014-02-02    A   900
11: 2010-01-01    B   100
12: 2010-02-02    B    80
13: 2011-01-01    B   130
14: 2011-02-02    B    70
15: 2012-01-01    B   300
16: 2012-02-02    B   100
17: 2013-01-01    B   120
18: 2013-02-02    B   230
19: 2014-01-01    B   650
20: 2014-02-02    B   870
21: 2010-01-01    C   100
22: 2010-02-02    C    90
23: 2011-01-01    C   110
24: 2011-02-02    C   450
25: 2012-01-01    C   342
26: 2012-02-02    C   100
27: 2013-01-01    C   150
28: 2013-02-02    C   270
29: 2014-01-01    C   300
30: 2014-02-02    C   400
31: 2010-01-01    D   100
32: 2010-02-02    D   120
33: 2011-01-01    D   200
34: 2011-02-02    D   400
35: 2012-01-01    D   900
36: 2012-02-02    D   100
37: 2013-01-01    D    80
38: 2013-02-02    D   130
39: 2014-01-01    D    70
40: 2014-02-02    D   300
41: 2010-01-01    E   100
42: 2010-02-02    E   120
43: 2011-01-01    E   230
44: 2011-02-02    E   650
45: 2012-01-01    E   870
46: 2012-02-02    E   100
47: 2013-01-01    E    90
48: 2013-02-02    E   110
49: 2014-01-01    E   450
50: 2014-02-02    E   342

What I have managed to do so far is below :

df[,year := lubridate::year(date)]  # extract the year 

df1 <- df[,.(total_score =sum(score)),.(Name,year)]  # Yearly Aggregated Scores

df1[,total_score_lag := shift(x=total_score,type = 'lag'),.(Name)]  ## creates a players lagged column of score
df1[,growth_rate := round(total_score/total_score_lag,2)]  ## creates ratio of current and past years scores column
df1[,growth_rate_lag := shift(x=growth_rate,type = 'lag'),.(Name)]  #### Creates a lag column of growth column

> df1
    Name year total_score total_score_lag growth_rate growth_rate_lag
 1:    A 2010         100              NA          NA              NA
 2:    A 2011         150             100        1.50              NA
 3:    A 2012         270             150        1.80            1.50
 4:    A 2013         300             270        1.11            1.80
 5:    A 2014         400             300        1.33            1.11
 6:    B 2010         100              NA          NA              NA
 7:    B 2011         120             100        1.20              NA
 8:    B 2012         200             120        1.67            1.20
 9:    B 2013         400             200        2.00            1.67
10:    B 2014         900             400        2.25            2.00
11:    C 2010         100              NA          NA              NA
12:    C 2011          80             100        0.80              NA
13:    C 2012         130              80        1.62            0.80
14:    C 2013          70             130        0.54            1.62
15:    C 2014         300              70        4.29            0.54
16:    D 2010         100              NA          NA              NA
17:    D 2011         120             100        1.20              NA
18:    D 2012         230             120        1.92            1.20
19:    D 2013         650             230        2.83            1.92
20:    D 2014         870             650        1.34            2.83
21:    E 2010         100              NA          NA              NA
22:    E 2011          90             100        0.90              NA
23:    E 2012         110              90        1.22            0.90
24:    E 2013         450             110        4.09            1.22
25:    E 2014         342             450        0.76            4.09

Now I understand that I need to have two conditions validated as

  • filter growth_rate column player_wise with greater than 1 value throughout.
  • filter growth_rate_lag column for the patients whose consecutive row values greater than previous row.

But I cant code for the said logic. Also there could be an alternative way of looking into it too. I would appreciate if anyone helps. Thanks in advance.

Edit 1 : The example I used was not accurate. So an updated example would be like:

date <- as.Date(x = c('2010/01/01','2010/02/02',
                      '2011/01/01','2011/02/02',
                      '2012/01/01','2012/02/02',
                      '2013/01/01','2013/02/02',
                      '2014/01/01','2014/02/02'),format = "%Y/%m/%d")

PlayerName  <- rep(LETTERS[1:5],each=10) # Name Players as A:E
score <- c(40,60,100,50,70,200,120,180,380,20,
           40,60,20,100,150,50,300,100,800,100,
           10,90,30,50,100,30,10,60,100,200,
           50,50,100,20,200,30,400,60,570,400,
           80,20,70,20,100,10,400,50,142,200)
df <- data.table(date=date,Name=Name,score=score)
df[,year := lubridate::year(date)]  # extract the year 

df1 <- df[,.(total_score =sum(score)),.(Name,year)]  # Yearly Aggregated Scores

df1[,total_score_lag := shift(x=total_score,type = 'lag'),.(Name)]  ## creates a players lagged column of score
df1[,growth_rate := round(total_score/total_score_lag,2)]  ## creates ratio of current and past years scores column
df1[,growth_rate_lag := shift(x=growth_rate,type = 'lag'),.(Name)]  #### Creates a lag column of growth column

  Name year total_score total_score_lag growth_rate growth_rate_lag
 1:    A 2010         100              NA          NA              NA
 2:    A 2011         150             100        1.50              NA
 3:    A 2012         270             150        1.80            1.50
 4:    A 2013         300             270        1.11            1.80
 5:    A 2014         400             300        1.33            1.11
 6:    B 2010         100              NA          NA              NA
 7:    B 2011         120             100        1.20              NA
 8:    B 2012         200             120        1.67            1.20
 9:    B 2013         400             200        2.00            1.67
10:    B 2014         900             400        2.25            2.00
11:    C 2010         100              NA          NA              NA
12:    C 2011          80             100        0.80              NA
13:    C 2012         130              80        1.62            0.80
14:    C 2013          70             130        0.54            1.62
15:    C 2014         300              70        4.29            0.54
16:    D 2010         100              NA          NA              NA
17:    D 2011         120             100        1.20              NA
18:    D 2012         230             120        1.92            1.20
19:    D 2013         460             230        2.00            1.92
20:    D 2014         970             460        2.11            2.00
21:    E 2010         100              NA          NA              NA
22:    E 2011          90             100        0.90              NA
23:    E 2012         110              90        1.22            0.90
24:    E 2013         450             110        4.09            1.22
25:    E 2014         342             450        0.76            4.09
    

Now here clearly Player A,B ,D meets condition1 and but only B and D meets condition 2. And as D has highest total_score answer is D.


Solution

  • With data.table you could use cumsum to select player up to the last year it achieved increasing score growth rate :

    df1[,selected :=cumsum(fifelse(growth_rate>growth_rate_lag|is.na(growth_rate_lag),1L,NA_integer_)),by=Name]
    df1[selected>0]
    
        Name year total_score total_score_lag growth_rate growth_rate_lag selected
     1:    A 2010         250              NA          NA              NA        1
     2:    A 2011         570             250        2.28              NA        2
     3:    B 2010         180              NA          NA              NA        1
     4:    B 2011         200             180        1.11              NA        2
     5:    B 2012         400             200        2.00            1.11        3
     6:    C 2010         190              NA          NA              NA        1
     7:    C 2011         560             190        2.95              NA        2
     8:    D 2010         220              NA          NA              NA        1
     9:    D 2011         600             220        2.73              NA        2
    10:    E 2010         220              NA          NA              NA        1
    11:    E 2011         880             220        4.00              NA        2
    

    As noted in the other answers, no player achieved increasing rate in this dataset.