The following dataset is a dummy example of the problem I am having.There are 3 columns in my data viz: Date
PlayerName
and Score
.Thus each player's date-wise score is recorded.
The task is to find that player with maximum TotalScore (Over all the observations) from the subset of players that fulfills the two following criteria:
Dataframe looks like :
date <- as.Date(x = c('2010/01/01','2010/02/02',
'2011/01/01','2011/02/02',
'2012/01/01','2012/02/02',
'2013/01/01','2013/02/02',
'2014/01/01','2014/02/02'),format = "%Y/%m/%d") #toy date column
PlayerName <- rep(LETTERS[1:5],each=10) # Name Players as A:E
score <- c(100,150,270,300,400,
100,120,200,400,900,
100,80,130,70,300,
100,120,230,650,870,
100,90,110,450,342)
df <- data.table(date=date,Name=PlayerName,score=score)
> df
date Name score
1: 2010-01-01 A 100
2: 2010-02-02 A 150
3: 2011-01-01 A 270
4: 2011-02-02 A 300
5: 2012-01-01 A 400
6: 2012-02-02 A 100
7: 2013-01-01 A 120
8: 2013-02-02 A 200
9: 2014-01-01 A 400
10: 2014-02-02 A 900
11: 2010-01-01 B 100
12: 2010-02-02 B 80
13: 2011-01-01 B 130
14: 2011-02-02 B 70
15: 2012-01-01 B 300
16: 2012-02-02 B 100
17: 2013-01-01 B 120
18: 2013-02-02 B 230
19: 2014-01-01 B 650
20: 2014-02-02 B 870
21: 2010-01-01 C 100
22: 2010-02-02 C 90
23: 2011-01-01 C 110
24: 2011-02-02 C 450
25: 2012-01-01 C 342
26: 2012-02-02 C 100
27: 2013-01-01 C 150
28: 2013-02-02 C 270
29: 2014-01-01 C 300
30: 2014-02-02 C 400
31: 2010-01-01 D 100
32: 2010-02-02 D 120
33: 2011-01-01 D 200
34: 2011-02-02 D 400
35: 2012-01-01 D 900
36: 2012-02-02 D 100
37: 2013-01-01 D 80
38: 2013-02-02 D 130
39: 2014-01-01 D 70
40: 2014-02-02 D 300
41: 2010-01-01 E 100
42: 2010-02-02 E 120
43: 2011-01-01 E 230
44: 2011-02-02 E 650
45: 2012-01-01 E 870
46: 2012-02-02 E 100
47: 2013-01-01 E 90
48: 2013-02-02 E 110
49: 2014-01-01 E 450
50: 2014-02-02 E 342
What I have managed to do so far is below :
df[,year := lubridate::year(date)] # extract the year
df1 <- df[,.(total_score =sum(score)),.(Name,year)] # Yearly Aggregated Scores
df1[,total_score_lag := shift(x=total_score,type = 'lag'),.(Name)] ## creates a players lagged column of score
df1[,growth_rate := round(total_score/total_score_lag,2)] ## creates ratio of current and past years scores column
df1[,growth_rate_lag := shift(x=growth_rate,type = 'lag'),.(Name)] #### Creates a lag column of growth column
> df1
Name year total_score total_score_lag growth_rate growth_rate_lag
1: A 2010 100 NA NA NA
2: A 2011 150 100 1.50 NA
3: A 2012 270 150 1.80 1.50
4: A 2013 300 270 1.11 1.80
5: A 2014 400 300 1.33 1.11
6: B 2010 100 NA NA NA
7: B 2011 120 100 1.20 NA
8: B 2012 200 120 1.67 1.20
9: B 2013 400 200 2.00 1.67
10: B 2014 900 400 2.25 2.00
11: C 2010 100 NA NA NA
12: C 2011 80 100 0.80 NA
13: C 2012 130 80 1.62 0.80
14: C 2013 70 130 0.54 1.62
15: C 2014 300 70 4.29 0.54
16: D 2010 100 NA NA NA
17: D 2011 120 100 1.20 NA
18: D 2012 230 120 1.92 1.20
19: D 2013 650 230 2.83 1.92
20: D 2014 870 650 1.34 2.83
21: E 2010 100 NA NA NA
22: E 2011 90 100 0.90 NA
23: E 2012 110 90 1.22 0.90
24: E 2013 450 110 4.09 1.22
25: E 2014 342 450 0.76 4.09
Now I understand that I need to have two conditions validated as
growth_rate
column player_wise with greater than 1 value throughout.growth_rate_lag
column for the patients whose consecutive row values greater than previous row.But I cant code for the said logic. Also there could be an alternative way of looking into it too. I would appreciate if anyone helps. Thanks in advance.
Edit 1 : The example I used was not accurate. So an updated example would be like:
date <- as.Date(x = c('2010/01/01','2010/02/02',
'2011/01/01','2011/02/02',
'2012/01/01','2012/02/02',
'2013/01/01','2013/02/02',
'2014/01/01','2014/02/02'),format = "%Y/%m/%d")
PlayerName <- rep(LETTERS[1:5],each=10) # Name Players as A:E
score <- c(40,60,100,50,70,200,120,180,380,20,
40,60,20,100,150,50,300,100,800,100,
10,90,30,50,100,30,10,60,100,200,
50,50,100,20,200,30,400,60,570,400,
80,20,70,20,100,10,400,50,142,200)
df <- data.table(date=date,Name=Name,score=score)
df[,year := lubridate::year(date)] # extract the year
df1 <- df[,.(total_score =sum(score)),.(Name,year)] # Yearly Aggregated Scores
df1[,total_score_lag := shift(x=total_score,type = 'lag'),.(Name)] ## creates a players lagged column of score
df1[,growth_rate := round(total_score/total_score_lag,2)] ## creates ratio of current and past years scores column
df1[,growth_rate_lag := shift(x=growth_rate,type = 'lag'),.(Name)] #### Creates a lag column of growth column
Name year total_score total_score_lag growth_rate growth_rate_lag
1: A 2010 100 NA NA NA
2: A 2011 150 100 1.50 NA
3: A 2012 270 150 1.80 1.50
4: A 2013 300 270 1.11 1.80
5: A 2014 400 300 1.33 1.11
6: B 2010 100 NA NA NA
7: B 2011 120 100 1.20 NA
8: B 2012 200 120 1.67 1.20
9: B 2013 400 200 2.00 1.67
10: B 2014 900 400 2.25 2.00
11: C 2010 100 NA NA NA
12: C 2011 80 100 0.80 NA
13: C 2012 130 80 1.62 0.80
14: C 2013 70 130 0.54 1.62
15: C 2014 300 70 4.29 0.54
16: D 2010 100 NA NA NA
17: D 2011 120 100 1.20 NA
18: D 2012 230 120 1.92 1.20
19: D 2013 460 230 2.00 1.92
20: D 2014 970 460 2.11 2.00
21: E 2010 100 NA NA NA
22: E 2011 90 100 0.90 NA
23: E 2012 110 90 1.22 0.90
24: E 2013 450 110 4.09 1.22
25: E 2014 342 450 0.76 4.09
Now here clearly Player A,B ,D meets condition1 and but only B and D meets condition 2. And as D has highest total_score
answer is D.
With data.table
you could use cumsum
to select player up to the last year it achieved increasing score growth rate :
df1[,selected :=cumsum(fifelse(growth_rate>growth_rate_lag|is.na(growth_rate_lag),1L,NA_integer_)),by=Name]
df1[selected>0]
Name year total_score total_score_lag growth_rate growth_rate_lag selected
1: A 2010 250 NA NA NA 1
2: A 2011 570 250 2.28 NA 2
3: B 2010 180 NA NA NA 1
4: B 2011 200 180 1.11 NA 2
5: B 2012 400 200 2.00 1.11 3
6: C 2010 190 NA NA NA 1
7: C 2011 560 190 2.95 NA 2
8: D 2010 220 NA NA NA 1
9: D 2011 600 220 2.73 NA 2
10: E 2010 220 NA NA NA 1
11: E 2011 880 220 4.00 NA 2
As noted in the other answers, no player achieved increasing rate in this dataset.