Search code examples
rbayesianjags

Failed to set trace monitor for X. Variable X not found


I'm trying to implement a Bayesian model for football score prediction (English Premier League). I assume the number of scores follows the Poisson distribution with parameter lambda and this is the general formula:

log(home_lambda) = mu + hfa + home_attack - away_def

log(away_lambda) = mu + away_attack - home_def

where mu and hfa are the intercept and home field advantage respectively. And this is my data (380 x 8) enter image description here

So basically I have 380x2 equations and 42 parameters (mu, hfa, attack ability of 20 teams and defence ability of 20 teams). Take the first match as an example:

FTHG[1] ~ dpois(home_lambda[1]) # FTHG is full time home goal
log(home_lambda[1]) = mu + hfa + att_Crystal_Palace - att_Arsenal

FTAG[1] ~ dpois(lam_a[1]) # FTAG is full time away goal
log(away_lambda[1]) = mu + att_Arsenal - def_Crystal_Palace

Below is my entire code

bayes_mod_string = " model{
  
  for (i in 1:380){
  
    FTHG[i] ~ dpois(lam_h[i])
    log(lam_h[i]) = mu + hfa + h_att[i] - a_def[i]
    FTAG[i] ~ dpois(lam_a[i])
    log(lam_a[i]) = mu + a_att[i] - h_def[i]
    
    h_att[i] ~ dnorm(0, 1/1e6) # the attack and defence ability of each team follows a Normal distribution
    a_def[i] ~ dnorm(0, 1/1e6)
    a_att[i] ~ dnorm(0, 1/1e6)
    h_def[i] ~ dnorm(0, 1/1e6)

  }
  
  mu ~ dnorm(0, 1/1e6)
  hfa ~ dnorm(0, 1/1e6)
  
}"

data_jags = as.list(test_set)

params = c('mu', 'hfa', unique(test_set$h_att), unique(test_set$a_def)) # 42 in total

bayes_mod_string = jags.model(textConnection(bayes_mod_string), 
                              data = data_jags, n.chains = 3)
update(bayes_mod_string, 1e3)

bayes_mod_sim = coda.samples(model = bayes_mod_string,
                             variable.names = params, n.iter = 1e3)
bayes_mod_csim = as.mcmc(do.call(rbind, bayes_mod_sim))

The error message is

Warning: Failed to set trace monitor for att_Crystal_Palace Variable att_Crystal_Palace not found

Not only with the "att_Crystal_Palace" parameter but also with all parameters.

I tried typing out each team attack and defence ability separately, putting them in and out the for loop

att_Crystal_Palace  ~ dnorm(0, 1/1e6)
  att_Fulham  ~ dnorm(0, 1/1e6)
  att_Bournemouth  ~ dnorm(0, 1/1e6)
  att_Leeds  ~ dnorm(0, 1/1e6)
  att_Newcastle  ~ dnorm(0, 1/1e6)
  att_Tottenham  ~ dnorm(0, 1/1e6)
  att_Everton  ~ dnorm(0, 1/1e6)
  att_Leicester  ~ dnorm(0, 1/1e6)  
  att_Man_United  ~ dnorm(0, 1/1e6)
  att_West_Ham  ~ dnorm(0, 1/1e6)
  att_Aston_Villa  ~ dnorm(0, 1/1e6)
  att_Arsenal  ~ dnorm(0, 1/1e6)
  att_Brighton  ~ dnorm(0, 1/1e6)
  att_Man_City  ~ dnorm(0, 1/1e6)
  att_Southampton  ~ dnorm(0, 1/1e6)
  att_Wolves  ~ dnorm(0, 1/1e6)
  att_Brentford  ~ dnorm(0, 1/1e6)
  att_Nottm_Forest  ~ dnorm(0, 1/1e6)
  att_Chelsea  ~ dnorm(0, 1/1e6)
  att_Liverpool ~ dnorm(0, 1/1e6)
  
  def_Crystal_Palace  ~ dnorm(0, 1/1e6)
  def_Fulham  ~ dnorm(0, 1/1e6)
  def_Bournemouth  ~ dnorm(0, 1/1e6)
  def_Leeds  ~ dnorm(0, 1/1e6)
  def_Newcastle  ~ dnorm(0, 1/1e6)
  def_Tottenham  ~ dnorm(0, 1/1e6)
  def_Everton  ~ dnorm(0, 1/1e6)
  def_Leicester  ~ dnorm(0, 1/1e6)  
  def_Man_United  ~ dnorm(0, 1/1e6)
  def_West_Ham  ~ dnorm(0, 1/1e6)
  def_Aston_Villa  ~ dnorm(0, 1/1e6)
  def_Arsenal  ~ dnorm(0, 1/1e6)
  def_Brighton  ~ dnorm(0, 1/1e6)
  def_Man_City  ~ dnorm(0, 1/1e6)
  def_Southampton  ~ dnorm(0, 1/1e6)
  def_Wolves  ~ dnorm(0, 1/1e6)
  def_Brentford  ~ dnorm(0, 1/1e6)
  def_Nottm_Forest  ~ dnorm(0, 1/1e6)
  def_Chelsea  ~ dnorm(0, 1/1e6)
  def_Liverpool ~ dnorm(0, 1/1e6)

This time the error is

RUNTIME ERROR: Compilation error on line 50. Attempt to redefine node def_Liverpool[1]

What the problem actually is and how can I tackle it? If you'd like to get the data, it's here https://www.football-data.co.uk/englandm.php, season 2022/2023


Solution

  • Your primary problem is here

    log(lam_h[i]) = mu + hfa + h_att[i] - a_def[i]
    

    h_att[i] resolves to, say, att_Arsenal, as you say, but you don't define a node (variable) called att_Arsenal. You have similar problems for h_def, a_att and a_def.

    The easiest way to deal with this number of predictor variables is to use arrays rather than stand-alone scalars. I've used data from the 2023/24 season from the link you provided. My input data looks like this:

    df
    # A tibble: 39 × 7
       Div   Date       Time  HomeTeam AwayTeam  FTHG  FTAG
       <chr> <chr>      <chr>    <int>    <int> <int> <int>
     1 E0    11/08/2023 20:00        6       13     0     3
     2 E0    12/08/2023 12:30        1       16     2     1
     3 E0    12/08/2023 15:00        3       19     1     1
     4 E0    12/08/2023 15:00        5       12     4     1
     5 E0    12/08/2023 15:00        9       10     0     1
     6 E0    12/08/2023 15:00       17        8     0     1
     7 E0    12/08/2023 17:30       15        2     5     1
     8 E0    13/08/2023 14:00        4       18     2     2
     9 E0    13/08/2023 16:30        7       11     1     1
    10 E0    14/08/2023 20:00       14       20     1     0
    # … with 29 more rows
    

    where I have indexed both HomeTeam and AwayTeam alphabetically. So, Arsenal are 1 and Wolves are 20.

    Then I can write my model string as

    bayes_mod_string = " 
    data {
      mu ~ dnorm(0, 1/1e6)
      hfa ~ dnorm(0, 1/1e6)
      for (j in 1:nTeams) {
        h_att[j] ~ dnorm(0, 1/1e6) 
        a_def[j] ~ dnorm(0, 1/1e6)
        a_att[j] ~ dnorm(0, 1/1e6)
        h_def[j] ~ dnorm(0, 1/1e6)
        log(lam_h[j]) = mu + hfa + h_att[j] - a_def[j]
        log(lam_a[j]) = mu + a_att[j] - h_def[j]
      }
    }
    
    model {
      for (i in 1:nMatches) {
        FTHG[i] ~ dpois(lam_h[home[i]])
        FTAG[i] ~ dpois(lam_a[away[i]])
      }
    }
    "
    

    Note the double-indexing of lam_a and lam_h in the model block.

    I update the model and obtain samples with

    library(rjags)
    
    bayes_model = jags.model(textConnection(bayes_mod_string), 
                                  data = data_jags, n.chains = 3)
    data_jags <- list(
      nTeams = 20,
      nMatches = nrow(df),
      FTHG = df$FTHG,
      FTAG = df$FTAG,
      home = df$HomeTeam,
      away = df$AwayTeam
    )
    
    update(bayes_model, 1e3)
    
    samples <- jags.samples(
      model = bayes_model,
      variable.names = c('mu', 'hfa', paste0("h_att[", 1:20, "]"), paste0("a_def[", 1:20, "]")), 
      n.iter = 1e3
    )
    

    Note that I've changed coda.samples to jags.samples.

    I omit printing samples because the output is rather long.

    Personally, I prefer runjags to rjags. You can use the model string directly rather than passing through a text connection or writing to a file, the model fitting code is more compact and I feel the output is both more compact and more informative.

    library(runjags)
    
    run.jags(
      bayes_mod_string, 
      data = data_jags,
      monitor = c('mu', 'hfa', paste0("h_att[", 1:20, "]"), paste0("a_def[", 1:20, "]"))
    )
    Compiling rjags model...
    Calling the simulation using the rjags method...
    Note: the model did not require adaptation
    Burning in the model for 4000 iterations...
      |**************************************************| 100%
    Running the model for 10000 iterations...
      |**************************************************| 100%
    Simulation complete
    Calculating summary statistics...
    Note: The monitored variables 'mu', 'hfa', 'h_att[1]', 'h_att[2]', 'h_att[3]', 'h_att[4]', 'h_att[5]', 'h_att[6]', 'h_att[7]', 'h_att[8]', 'h_att[9]',
    'h_att[10]', 'h_att[11]', 'h_att[12]', 'h_att[13]', 'h_att[14]', 'h_att[15]', 'h_att[16]', 'h_att[17]', 'h_att[18]', 'h_att[19]', 'h_att[20]', 'a_def[1]',
    'a_def[2]', 'a_def[3]', 'a_def[4]', 'a_def[5]', 'a_def[6]', 'a_def[7]', 'a_def[8]', 'a_def[9]', 'a_def[10]', 'a_def[11]', 'a_def[12]', 'a_def[13]',
    'a_def[14]', 'a_def[15]', 'a_def[16]', 'a_def[17]', 'a_def[18]', 'a_def[19]' and 'a_def[20]' appear to be non-stochastic; they will not be included in the
    convergence diagnostic
    Finished running the simulation
    
    JAGS model summary statistics from 20000 samples (chains = 2; adapt+burnin = 5000):
                                                                             
              Lower95 Upper95    Mean SD Mode MCerr MC%ofSD SSeff Lag.10 psrf
    mu         827.99  827.99  827.99  0   --    --      --    --     --   --
    hfa        1122.6  1122.6  1122.6  0   --    --      --    --     --   --
    h_att[1]   287.62  287.62  287.62  0   --    --      --    --     --   --
    h_att[2]  -696.89 -696.89 -696.89  0   --    --      --    --     --   --
    h_att[3]   332.89  332.89  332.89  0   --    --      --    --     --   --
    h_att[4]   941.39  941.39  941.39  0   --    --      --    --     --   --
    h_att[5]   795.37  795.37  795.37  0   --    --      --    --     --   --
    <further output omitted>
    

    IMPORTANT The lack of variability in the output indicates a problem with either the model specification or my implementation. Personally, I'd have fitted a hierarchical model. I don't have time to explore further right now, but at least I've given you clues about how to proceed.