I'm performing an analysis on basketball data. This is how my dataset looks like (a really exemplified version of it):
df<-data.frame(gmID = 1:20,
H.Team = c("CLE", "MIA", "LAL", "PHI", "CLE", "DET", "CHI", "DAL", "UTA", "PHO", "POR", "WAS", "ORL", "CHA", "BOS", "CHI", "ATL", "DAL", "CLE", "WAS"),
A.Team = c("WAS", "BOS", "DAL", "DEN", "IND", "HOU", "SAC", "WAS", "DAL", "CLE", "LAL", "OKC", "DEN", "IND", "MIL", "CLE", "HOU", "MIA", "UTA", "DEN"),
H.PTS = c(94, 120, 91, 84, 88, 96, 93, 95, 113, 85, 116, 86, 102, 90, 88, 86, 102, 104, 88, 111),
A.PTS = c(84, 107, 99, 75, 90, 105, 87, 99, 94, 87, 106, 84, 89, 89, 99, 115, 109, 84, 86, 88),
H.AST = c(22, 25, 24, 18, 18, 21, 21, 26, 24, 16, 19, 27, 24, 18, 22, 23, 23, 27, 23, 25),
A.AST = c(26, 24, 22, 19, 22, 28, 14, 22, 20, 19, 21, 18, 22, 19, 26, 34, 22, 18, 19, 16))
df
gmID H.Team A.Team H.PTS A.PTS H.AST A.AST
1 1 CLE WAS 94 84 22 26
2 2 MIA BOS 120 107 25 24
3 3 LAL DAL 91 99 24 22
4 4 PHI DEN 84 75 18 19
5 5 CLE IND 88 90 18 22
6 6 DET HOU 96 105 21 28
7 7 CHI SAC 93 87 21 14
8 8 DAL WAS 95 99 26 22
9 9 UTA DAL 113 94 24 20
10 10 PHO CLE 85 87 16 19
11 11 POR LAL 116 106 19 21
12 12 WAS OKC 86 84 27 18
13 13 ORL DEN 102 89 24 22
14 14 CHA IND 90 89 18 19
15 15 BOS MIL 88 99 22 26
16 16 CHI CLE 86 115 23 34
17 17 ATL HOU 102 109 23 22
18 18 DAL MIA 104 84 27 18
19 19 CLE UTA 88 86 23 19
20 20 WAS DEN 111 88 25 16
To simplify the problem, I selected 20 rows from the original dataset and just 2 pair of game stats, points (PTS) and assist (AST) made by home (H.) and away (A.) team (the are 50 more pairs of game stats in my data). Note that rows are sorted by date, so they don't need to get arranged.
The goal is to perform a prediction of the game result (1 if H.Team wins, 0 if A.Team wins), however the form the data is in now is useless, because the stats are reported once the match ended.
So, the idea is to replace each value of the data frame with the rolling mean of the previous n match. I will set n=7
or n=10
in my work, but that doesn't really matter now, so to simplify I will set n=3
.
The output should look like this:
gmID H.Team A.Team H.PTSav A.PTSav H.ASTav A.ASTav
1 1 CLE WAS NA NA NA NA
2 2 MIA BOS NA NA NA NA
3 3 LAL DAL NA NA NA NA
4 4 PHI DEN NA NA NA NA
5 5 CLE IND 94.00000 NA 22.00000 NA
6 6 DET HOU NA NA NA NA
7 7 CHI SAC NA NA NA NA
8 8 DAL WAS 99.00000 84.00000 22.00000 26.00000
9 9 UTA DAL NA 97.00000 NA 24.00000
10 10 PHO CLE NA 91.00000 NA 20.00000
11 11 POR LAL NA 91.00000 NA 24.00000
12 12 WAS OKC 91.50000 NA 24.00000 NA
13 13 ORL DEN NA 75.00000 NA 19.00000
14 14 CHA IND NA 90.00000 NA 22.00000
15 15 BOS MIL 107.00000 NA 24.00000 NA
16 16 CHI CLE 93.00000 89.66667 21.00000 19.66667
17 17 ATL HOU NA 105.00000 NA 28.00000
18 18 DAL MIA 96.00000 120.00000 22.66667 25.00000
19 19 CLE UTA 96.66667 113.00000 23.66667 24.00000
20 20 WAS DEN 89.66667 82.00000 25.00000 20.50000
For example, for team CLE
, that played 5 games, the average PTS
values are the following:
gmID avpts
1 1 NA ---> NA
2 5 94.00000 ---> 94/1
3 10 91.00000 ---> (94+88)/2
4 16 89.66667 ---> (94+88+87)/3
5 19 96.66667 ---> (88+87+115)/3
I used dplyr
and in particular zoo::rollaply
function to get the values displayed above, the code is the following:
library(dplyr)
library(zoo)
sub<- df %>%
filter(H.Team == "CLE" | A.Team == "CLE") %>%
mutate(avpts = lag(rollapply(ifelse(H.Team == "CLE", H.PTS,A.PTS), width=3, FUN=mean, align="right", fill=NA, partial=1))) %>%
select(gmID,avpts)
sub
I did this just for a team and a variable, but I could easily do it for more variables specifying it in mutate()
, like this:
mutate(avpts = lag(rollapply(ifelse(H.Team == "CLE", H.PTS,A.PTS), width=3, FUN=mean, align="right", fill=NA, partial=1)),
avast = lag(rollapply(ifelse(H.Team == "CLE", H.AST,A.AST), width=3, FUN=mean, align="right", fill=NA, partial=1)))
The problem is that I should do this for other 50 variables, and above all I need to calculate values for all the teams, not just for one. Furthermore I obtained a column with the right values, but I don't know how to replace them in their "correct" position.
My idea to (partially) solve the problem is to wrap the code above in a function and then use another function from the apply
family to get values for all the teams, without using a for loop.
I wrote the following function:
avstats<- function(team) {
sub <- df %>%
filter(.data$H.Team == !!team | .data$A.Team == !!team) %>%
mutate(avpts = lag(rollapply(ifelse(H.Team == !!team, .data$H.PTS, .data$A.PTS),3,mean,align="right",fill=NA,partial=1))) %>%
select(.data$gmID, .data$avpts)
}
Finally i used lapply()
through a list of the teams that are in this small dataset.
teams <- c("CLE", "MIA", "LAL", "PHI", "DET", "CHI", "DAL", "UTA", "PHO", "POR", "WAS", "ORL", "CHA", "BOS", "ATL", "DEN", "IND", "HOU", "SAC", "OKC","MIL")
lapply(teams,avstats)
And everything seems working good with the two functions.
But there are still two main question that I'd like to get answered:
How can I get the values for all the others variable without writing a row for each stat? (i.e. how to get average assist, turnovers, ecc..)
How to "combine" and "put in the right place" the new average values generated in such a way that the original data structure are still the same?
Maybe I should modify my function avstats
adding some arguments and as consequence use another apply() function such as mapply()
, but i really don't know how to do.
Do you want this? (mean_run
from library(runner)
used).
.cols
argument of mutate(across...
k
in mean_run
as per choice.df %>% pivot_longer(!gmID, names_to = c("H_T", ".value"),
names_pattern = "(.+)\\.(.+)") %>%
group_by(Team) %>%
mutate(across(.cols = c(PTS, AST),
~ runner::mean_run(x = ., k = 3, lag = 1),
.names = '{.col}_av')) %>%
pivot_wider(id_cols = gmID,
names_from = H_T,
names_glue = "{H_T}_{.value}",
values_from = -c(gmID, H_T))
# A tibble: 20 x 11
gmID H_Team A_Team H_PTS A_PTS H_AST A_AST H_PTS_av A_PTS_av H_AST_av A_AST_av
<int> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 CLE WAS 94 84 22 26 NA NA NA NA
2 2 MIA BOS 120 107 25 24 NA NA NA NA
3 3 LAL DAL 91 99 24 22 NA NA NA NA
4 4 PHI DEN 84 75 18 19 NA NA NA NA
5 5 CLE IND 88 90 18 22 94 NA 22 NA
6 6 DET HOU 96 105 21 28 NA NA NA NA
7 7 CHI SAC 93 87 21 14 NA NA NA NA
8 8 DAL WAS 95 99 26 22 99 84 22 26
9 9 UTA DAL 113 94 24 20 NA 97 NA 24
10 10 PHO CLE 85 87 16 19 NA 91 NA 20
11 11 POR LAL 116 106 19 21 NA 91 NA 24
12 12 WAS OKC 86 84 27 18 91.5 NA 24 NA
13 13 ORL DEN 102 89 24 22 NA 75 NA 19
14 14 CHA IND 90 89 18 19 NA 90 NA 22
15 15 BOS MIL 88 99 22 26 107 NA 24 NA
16 16 CHI CLE 86 115 23 34 93 89.7 21 19.7
17 17 ATL HOU 102 109 23 22 NA 105 NA 28
18 18 DAL MIA 104 84 27 18 96 120 22.7 25
19 19 CLE UTA 88 86 23 19 96.7 113 23.7 24
20 20 WAS DEN 111 88 25 16 89.7 82 25 20.5