I've found NHL shift data and would like to evaluate a model where goals scored would follow be a poisson distribution depending on who is on the ice for both teams.
My point is that we already have a good idea of who can score points (goals and assists), but maybe someone is really good at helping his team score without getting on the score sheet (maybe by generating turnovers?) or is just super good at preventing the other team from scoring.
I am able create a dataset that looks like "data" below. There are normally 5 players on the ice for each team, but I only put 2 to make the example digestible.
Basically, I have a line per shift, I know the outcome of the shift (goal_for), the shift_duration and I have a list of the ID's of the players playing for the team (for_players) and in the opposing team (against_players).
What I would like to do is take the "data" dataset and create the "model_data" with one dummy variable indicating whether a player is on the ice for a given shift. Then I would create a formula for my poisson model that would include all the dummies and pass it to the model. I could also drop one dummy for and one dummy against, but I can also let mgcv:gam do it for me.
I suspect this will involve some !! and quos(), but I am not sure how to do it.
data <- tibble(
shift_id = c(1, 2, 3, 4, 5, 6, 7, 8,9,10),
shift_duration = c(12, 7, 30, 11, 14, 16, 19, 32,11,12),
goal_for = c(1, 1, 0, 0, 1, 1, 0, 0,0,0),
for_players = list(
c("A", "B"),
c("A", "C"),
c("B", "C"),
c("A", "C"),
c("B", "C"),
c("A", "B"),
c("B", "C"),
c("A", "B"),
c("B", "C"),
c("A", "B")
),
against_players = list(
c("X", "Z"),
c("Y", "Z"),
c("X", "Y"),
c("X", "Y"),
c("X", "Z"),
c("Y", "Z"),
c("X", "Y"),
c("Y", "Z"),
c("X", "Y"),
c("Y", "Z")
)
)
(black magic goes here)
model_data <- tibble(
shift_id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
shift_duration = c(12, 7, 30, 11, 14, 16, 19, 32, 11, 12),
goal_for = c(1, 1, 0, 0, 1, 1, 0, 0, 0, 0),
for_player_A = c(1, 1, 0, 1, 0, 1, 0, 1, 0, 1),
for_player_B = c(1, 0, 1, 0, 1, 1, 1, 1, 1, 1),
for_player_C = c(0, 1, 1, 1, 1, 0, 1, 0, 1, 0),
against_player_X = c(1, 0, 1, 1, 1, 0, 1, 0, 1, 0),
against_player_Y = c(0, 1, 1, 1, 0, 1, 1, 1, 1, 1),
against_player_Z = c(1, 1, 0, 0, 1, 1, 0, 1, 0, 1)
)
mod.gam <- mgcv::gam(
data = model_data,
formula = goal_for ~ offset(log(shift_duration)) + for_player_A + for_player_B + for_player_C +
against_player_X + against_player_Y + against_player_Z,
family = poisson(link = log)
)
data looks like this:
> data
# A tibble: 10 x 5
shift_id shift_duration goal_for for_players against_players
<dbl> <dbl> <dbl> <list> <list>
1 1.00 12.0 1.00 <chr [2]> <chr [2]>
2 2.00 7.00 1.00 <chr [2]> <chr [2]>
3 3.00 30.0 0 <chr [2]> <chr [2]>
4 4.00 11.0 0 <chr [2]> <chr [2]>
5 5.00 14.0 1.00 <chr [2]> <chr [2]>
6 6.00 16.0 1.00 <chr [2]> <chr [2]>
7 7.00 19.0 0 <chr [2]> <chr [2]>
8 8.00 32.0 0 <chr [2]> <chr [2]>
9 9.00 11.0 0 <chr [2]> <chr [2]>
10 10.0 12.0 0 <chr [2]> <chr [2]>
Model data looks like this:
> model_data
# A tibble: 10 x 9
shift_id shift_duration goal_for for_player_A for_player_B for_player_C against_player_X against_player_Y against_player_Z
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1.00 12.0 1.00 1.00 1.00 0 1.00 0 1.00
2 2.00 7.00 1.00 1.00 0 1.00 0 1.00 1.00
3 3.00 30.0 0 0 1.00 1.00 1.00 1.00 0
4 4.00 11.0 0 1.00 0 1.00 1.00 1.00 0
5 5.00 14.0 1.00 0 1.00 1.00 1.00 0 1.00
6 6.00 16.0 1.00 1.00 1.00 0 0 1.00 1.00
7 7.00 19.0 0 0 1.00 1.00 1.00 1.00 0
8 8.00 32.0 0 1.00 1.00 0 0 1.00 1.00
9 9.00 11.0 0 0 1.00 1.00 1.00 1.00 0
10 10.0 12.0 0 1.00 1.00 0 0 1.00 1.00
Results from the model:
Family: poisson
Link function: log
Formula:
goal_for ~ offset(log(shift_duration)) + for_player_A + for_player_B +
for_player_C + against_player_X + against_player_Y + against_player_Z
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -22.0296 4317.9341 -0.005 0.996
for_player_A 0.0000 0.0000 NA NA
for_player_B -2.3026 2.0000 -1.151 0.250
for_player_C -0.1542 1.4142 -0.109 0.913
against_player_X 1.6094 1.4142 1.138 0.255
against_player_Y 0.0000 0.0000 NA NA
against_player_Z 20.2378 4317.9339 0.005 0.996
Rank: 5/7
R-sq.(adj) = 0.353 Deviance explained = 73.6%
UBRE = 0.26435 Scale est. = 1 n = 10
You can convert your data
data frame to your model_data
data frame using functions from tidyr
...
library(dplyr)
library(tidyr)
model_data <-
data %>%
unnest(for_players, .drop = F) %>%
spread(for_players, for_players, sep = '_') %>%
unnest(against_players, .drop = F) %>%
spread(against_players, against_players, sep = '_') %>%
mutate_at(vars(-(1:3)), funs(as.numeric(!is.na(.))))