Search code examples
rdplyrtidyrgam

programmatically create a dummy for each possible value of a variable, and pass these dummies to a formula


I've found NHL shift data and would like to evaluate a model where goals scored would follow be a poisson distribution depending on who is on the ice for both teams.

My point is that we already have a good idea of who can score points (goals and assists), but maybe someone is really good at helping his team score without getting on the score sheet (maybe by generating turnovers?) or is just super good at preventing the other team from scoring.

I am able create a dataset that looks like "data" below. There are normally 5 players on the ice for each team, but I only put 2 to make the example digestible.

Basically, I have a line per shift, I know the outcome of the shift (goal_for), the shift_duration and I have a list of the ID's of the players playing for the team (for_players) and in the opposing team (against_players).

What I would like to do is take the "data" dataset and create the "model_data" with one dummy variable indicating whether a player is on the ice for a given shift. Then I would create a formula for my poisson model that would include all the dummies and pass it to the model. I could also drop one dummy for and one dummy against, but I can also let mgcv:gam do it for me.

I suspect this will involve some !! and quos(), but I am not sure how to do it.

data <- tibble(
  shift_id = c(1, 2, 3, 4, 5, 6, 7, 8,9,10),
  shift_duration = c(12, 7, 30, 11, 14, 16, 19, 32,11,12),
  goal_for = c(1, 1, 0, 0, 1, 1, 0, 0,0,0),
  for_players = list(
    c("A", "B"),
    c("A", "C"),
    c("B", "C"),
    c("A", "C"),
    c("B", "C"),
    c("A", "B"),
    c("B", "C"),
    c("A", "B"),
    c("B", "C"),
    c("A", "B")
  ),
  against_players = list(
    c("X", "Z"),
    c("Y", "Z"),
    c("X", "Y"),
    c("X", "Y"),
    c("X", "Z"),
    c("Y", "Z"),
    c("X", "Y"),
    c("Y", "Z"),
    c("X", "Y"),
    c("Y", "Z")
  )
)


(black magic goes here)

model_data <- tibble(
  shift_id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
  shift_duration = c(12, 7, 30, 11, 14, 16, 19, 32, 11, 12),
  goal_for = c(1, 1, 0, 0, 1, 1, 0, 0, 0, 0),
  for_player_A = c(1, 1, 0, 1, 0, 1, 0, 1, 0, 1),
  for_player_B = c(1, 0, 1, 0, 1, 1, 1, 1, 1, 1),
  for_player_C = c(0, 1, 1, 1, 1, 0, 1, 0, 1, 0),
  against_player_X = c(1, 0, 1, 1, 1, 0, 1, 0, 1, 0),
  against_player_Y = c(0, 1, 1, 1, 0, 1, 1, 1, 1, 1),
  against_player_Z = c(1, 1, 0, 0, 1, 1, 0, 1, 0, 1)
)



mod.gam <- mgcv::gam(
  data = model_data,
  formula =  goal_for ~ offset(log(shift_duration)) + for_player_A + for_player_B  + for_player_C +
    against_player_X + against_player_Y + against_player_Z,
  family = poisson(link = log)
)

data looks like this:

> data
# A tibble: 10 x 5
   shift_id shift_duration goal_for for_players against_players
      <dbl>          <dbl>    <dbl> <list>      <list>         
 1     1.00          12.0      1.00 <chr [2]>   <chr [2]>      
 2     2.00           7.00     1.00 <chr [2]>   <chr [2]>      
 3     3.00          30.0      0    <chr [2]>   <chr [2]>      
 4     4.00          11.0      0    <chr [2]>   <chr [2]>      
 5     5.00          14.0      1.00 <chr [2]>   <chr [2]>      
 6     6.00          16.0      1.00 <chr [2]>   <chr [2]>      
 7     7.00          19.0      0    <chr [2]>   <chr [2]>      
 8     8.00          32.0      0    <chr [2]>   <chr [2]>      
 9     9.00          11.0      0    <chr [2]>   <chr [2]>      
10    10.0           12.0      0    <chr [2]>   <chr [2]>

Model data looks like this:

> model_data
# A tibble: 10 x 9
   shift_id shift_duration goal_for for_player_A for_player_B for_player_C against_player_X against_player_Y against_player_Z
      <dbl>          <dbl>    <dbl>        <dbl>        <dbl>        <dbl>            <dbl>            <dbl>            <dbl>
 1     1.00          12.0      1.00         1.00         1.00         0                1.00             0                1.00
 2     2.00           7.00     1.00         1.00         0            1.00             0                1.00             1.00
 3     3.00          30.0      0            0            1.00         1.00             1.00             1.00             0   
 4     4.00          11.0      0            1.00         0            1.00             1.00             1.00             0   
 5     5.00          14.0      1.00         0            1.00         1.00             1.00             0                1.00
 6     6.00          16.0      1.00         1.00         1.00         0                0                1.00             1.00
 7     7.00          19.0      0            0            1.00         1.00             1.00             1.00             0   
 8     8.00          32.0      0            1.00         1.00         0                0                1.00             1.00
 9     9.00          11.0      0            0            1.00         1.00             1.00             1.00             0   
10    10.0           12.0      0            1.00         1.00         0                0                1.00             1.00

Results from the model:

Family: poisson 
Link function: log 

Formula:
goal_for ~ offset(log(shift_duration)) + for_player_A + for_player_B + 
    for_player_C + against_player_X + against_player_Y + against_player_Z

Parametric coefficients:
                  Estimate Std. Error z value Pr(>|z|)
(Intercept)       -22.0296  4317.9341  -0.005    0.996
for_player_A        0.0000     0.0000      NA       NA
for_player_B       -2.3026     2.0000  -1.151    0.250
for_player_C       -0.1542     1.4142  -0.109    0.913
against_player_X    1.6094     1.4142   1.138    0.255
against_player_Y    0.0000     0.0000      NA       NA
against_player_Z   20.2378  4317.9339   0.005    0.996


Rank: 5/7
R-sq.(adj) =  0.353   Deviance explained = 73.6%
UBRE = 0.26435  Scale est. = 1         n = 10

Solution

  • You can convert your data data frame to your model_data data frame using functions from tidyr...

    library(dplyr)
    library(tidyr)
    
    model_data <-
      data %>% 
      unnest(for_players, .drop = F) %>% 
      spread(for_players, for_players, sep = '_') %>% 
      unnest(against_players, .drop = F) %>% 
      spread(against_players, against_players, sep = '_') %>% 
      mutate_at(vars(-(1:3)), funs(as.numeric(!is.na(.))))