Search code examples
remmeansbinomial-coefficientsmarginal-effectspscl

Interpretation of hurdle models with estimated marginal means


My goal is to interpret the coefficients of a hurdle model through estimated marginal means. I prefer to interpret probabilities (back-transformed from the logit scale), rather than log-odds (model coefficients) or odd-ratio (exp(log-odds)). I would like to use emmeans() for this goal, as it is compatible with many models, and I have experience using it in linear models and binomial GLMs.

The hurdle part of the hurdle model yields the same coefficients as a binomial GLM, which has been commented elsewhere (here or here).

However, I don't fully understand why, depending on the setting, the estimated means from the hurdle do not match those from the binomial GLM. In particular

  • lin.pred = FALSE. This yields different probabilites from a binomial GLM. Following the emmeans documentation, I think they are averaged at the probability scale.
  • lin.pred = TRUE, type = "response". This setting yields the same probabilites as binomial GLM. Following the emmeans documentation, I think they are averaged at the logit scale.

I have two questions:

    1. Which setting would be preferrable, if one's goal is to interpret probabilities, rather than logits?
    1. What is the meaning of Inf degrees of freedom?

See below a reproducible example, using the number of papers produced by Biochemstry students during the least 3 years of PhD, focusing on a pairwise comparison between two factor levels (single vs married).

My preferred approach would be the one that matches the estimated probabilites from the binomial GLM. I would interpret that, married PhD students are as likely than single students, to publish at least one paper (= overcome the hurdle). Married students have a 74% chance of producing at least one paper, whereas single ones have a 67% change: this 7% change is a small difference, and not significant.

Many thanks in advance!

library(emmeans)
library(pscl)
#> Warning: package 'pscl' was built under R version 4.2.3
#> Classes and Methods for R developed in the
#> Political Science Computational Laboratory
#> Department of Political Science
#> Stanford University
#> Simon Jackman
#> hurdle and zeroinfl functions by Achim Zeileis
data("bioChemists", package = "pscl")
# str(bioChemists)
hurdle <- hurdle(art ~ ., data = bioChemists, dist = "poisson", zero.dist = "binomial", link = "logit")
glm <- glm(art > 0 ~ ., data = bioChemists, family = binomial(link = "logit"))

# Same coefficients
coef(summary(hurdle))$zero
#>                Estimate Std. Error    z value     Pr(>|z|)
#> (Intercept)  0.23679601 0.29551883  0.8012891 4.229643e-01
#> femWomen    -0.25115113 0.15910522 -1.5785222 1.144457e-01
#> marMarried   0.32623358 0.18081823  1.8042074 7.119880e-02
#> kid5        -0.28524872 0.11113043 -2.5667921 1.026441e-02
#> phd          0.02221940 0.07955710  0.2792887 7.800233e-01
#> ment         0.08012135 0.01301763  6.1548321 7.515710e-10
coef(glm)
#> (Intercept)    femWomen  marMarried        kid5         phd        ment 
#>  0.23679601 -0.25115113  0.32623358 -0.28524872  0.02221940  0.08012135

####### 
## lin.pred = FALSE
emmeans::emmeans(hurdle, specs = pairwise ~ mar, mode = "zero", lin.pred = FALSE)
#> $emmeans
#>  mar     emmean    SE  df lower.CL upper.CL
#>  Single   0.806 0.167 903    0.478     1.13
#>  Married  0.858 0.122 903    0.619     1.10
#> 
#> Results are averaged over the levels of: fem 
#> Confidence level used: 0.95 
#> 
#> $contrasts
#>  contrast         estimate    SE  df t.ratio p.value
#>  Single - Married  -0.0522 0.213 903  -0.245  0.8066
#> 
#> Results are averaged over the levels of: fem


####### 
## lin.pred = TRUE, type = "response"
emmeans::emmeans(hurdle, specs = pairwise ~ mar, mode = "zero", lin.pred = TRUE, type = "response")
#> $emmeans
#>  mar      prob     SE  df lower.CL upper.CL
#>  Single  0.677 0.0306 903    0.615    0.734
#>  Married 0.744 0.0200 903    0.703    0.781
#> 
#> Results are averaged over the levels of: fem 
#> Confidence level used: 0.95 
#> Intervals are back-transformed from the logit scale 
#> 
#> $contrasts
#>  contrast         odds.ratio   SE  df null t.ratio p.value
#>  Single / Married      0.722 0.13 903    1  -1.804  0.0715
#> 
#> Results are averaged over the levels of: fem 
#> Tests are performed on the log odds ratio scale


emmeans::emmeans(glm, specs = pairwise ~ mar, regrid = "response")
#> $emmeans
#>  mar     response     SE  df asymp.LCL asymp.UCL
#>  Single     0.677 0.0305 Inf     0.617     0.736
#>  Married    0.743 0.0201 Inf     0.704     0.783
#> 
#> Results are averaged over the levels of: fem 
#> Confidence level used: 0.95 
#> 
#> $contrasts
#>  contrast         estimate     SE  df z.ratio p.value
#>  Single - Married  -0.0667 0.0377 Inf  -1.772  0.0764
#> 
#> Results are averaged over the levels of: fem

#######
# Estimates and contrasts at the logit scale appear to match
emmeans::emmeans(hurdle, specs = pairwise ~ mar, mode = "zero", lin.pred = TRUE)
#> $emmeans
#>  mar     emmean    SE  df lower.CL upper.CL
#>  Single   0.741 0.140 903    0.466     1.02
#>  Married  1.068 0.105 903    0.862     1.27
#> 
#> Results are averaged over the levels of: fem 
#> Results are given on the logit (not the response) scale. 
#> Confidence level used: 0.95 
#> 
#> $contrasts
#>  contrast         estimate    SE  df t.ratio p.value
#>  Single - Married   -0.326 0.181 903  -1.804  0.0715
#> 
#> Results are averaged over the levels of: fem 
#> Results are given on the log odds ratio (not the response) scale.


emmeans::emmeans(glm, specs = pairwise ~ mar)
#> $emmeans
#>  mar     emmean    SE  df asymp.LCL asymp.UCL
#>  Single   0.741 0.140 Inf     0.467      1.02
#>  Married  1.068 0.105 Inf     0.862      1.27
#> 
#> Results are averaged over the levels of: fem 
#> Results are given on the logit (not the response) scale. 
#> Confidence level used: 0.95 
#> 
#> $contrasts
#>  contrast         estimate    SE  df z.ratio p.value
#>  Single - Married   -0.326 0.181 Inf  -1.804  0.0712
#> 
#> Results are averaged over the levels of: fem 
#> Results are given on the log odds ratio (not the response) scale.


sessionInfo()
#> R version 4.2.0 (2022-04-22 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 22000)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=Spanish_Spain.utf8  LC_CTYPE=Spanish_Spain.utf8   
#> [3] LC_MONETARY=Spanish_Spain.utf8 LC_NUMERIC=C                  
#> [5] LC_TIME=Spanish_Spain.utf8    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] pscl_1.5.5.1          emmeans_1.8.5-9000004
#> 
#> loaded via a namespace (and not attached):
#>  [1] compiler_4.2.0      highr_0.9           tools_4.2.0        
#>  [4] digest_0.6.29       evaluate_0.15       lifecycle_1.0.3    
#>  [7] lattice_0.20-45     rlang_1.1.1         reprex_2.0.2       
#> [10] Matrix_1.4-1        cli_3.6.1           rstudioapi_0.13    
#> [13] yaml_2.3.5          mvtnorm_1.1-3       xfun_0.40          
#> [16] fastmap_1.1.0       coda_0.19-4         withr_2.5.0        
#> [19] stringr_1.5.0       knitr_1.39          fs_1.5.2           
#> [22] vctrs_0.6.3         grid_4.2.0          glue_1.6.2         
#> [25] survival_3.3-1      rmarkdown_2.14      multcomp_1.4-19    
#> [28] TH.data_1.1-1       magrittr_2.0.3      codetools_0.2-18   
#> [31] htmltools_0.5.6     splines_4.2.0       MASS_7.3-57        
#> [34] xtable_1.8-4        numDeriv_2016.8-1.1 sandwich_3.0-1     
#> [37] estimability_1.4.1  stringi_1.7.6       zoo_1.8-10

Created on 2023-09-06 with reprex v2.0.2


Solution

  • Russell Lenth (developper of the emmeans package), provided an answer over at GitHub. I paste it here, with a comparison between a hurdle model fitted with emmeans and glmmTMB, which show consistent results.

    library(emmeans)
    #> Warning: package 'emmeans' was built under R version 4.2.3
    library(glmmTMB)
    #> Warning: package 'glmmTMB' was built under R version 4.2.3
    #> Warning in checkMatrixPackageVersion(): Package version inconsistency detected.
    #> TMB was built with Matrix version 1.6.1
    #> Current Matrix version is 1.4.1
    #> Please re-install 'TMB' from source using install.packages('TMB', type = 'source') or ask CRAN for a binary version of 'TMB' matching CRAN's 'Matrix' package
    library(pscl)
    #> Warning: package 'pscl' was built under R version 4.2.3
    #> Classes and Methods for R developed in the
    #> Political Science Computational Laboratory
    #> Department of Political Science
    #> Stanford University
    #> Simon Jackman
    #> hurdle and zeroinfl functions by Achim Zeileis
    
    # Create datasets: number of papers during a Phd
    data("bioChemists", package = "pscl")
    
    # Declare models
    pscl.hurdle <- pscl::hurdle(art ~ fem + mar, data = bioChemists, dist = "poisson", zero.dist = "binomial", link = "logit")
    glmmtmb.hurdle <- glmmTMB::glmmTMB(art ~ fem + mar, data = bioChemists, family = truncated_poisson(link = "log"), zi = ~ fem + mar)
    
    # Truncated count
    emmeans(pscl.hurdle, ~ mar, mode = "count", lin.pred = TRUE, type = "response")
    #>  mar     count     SE  df lower.CL upper.CL
    #>  Single   2.07 0.1122 909     1.86     2.31
    #>  Married  2.09 0.0807 909     1.94     2.26
    #> 
    #> Results are averaged over the levels of: fem 
    #> Confidence level used: 0.95 
    #> Intervals are back-transformed from the log scale
    emmeans(glmmtmb.hurdle, ~ mar, comp = "cond", type = "response")
    #>  mar     rate     SE  df asymp.LCL asymp.UCL
    #>  Single  2.07 0.1122 Inf      1.86      2.30
    #>  Married 2.09 0.0807 Inf      1.94      2.26
    #> 
    #> Results are averaged over the levels of: fem 
    #> Confidence level used: 0.95 
    #> Intervals are back-transformed from the log scale
    
    # Binomial
    emmeans(pscl.hurdle, ~ mar, mode = "prob0")
    #>  mar     emmean     SE  df lower.CL upper.CL
    #>  Single   0.313 0.0265 909    0.261    0.365
    #>  Married  0.297 0.0191 909    0.259    0.334
    #> 
    #> Results are averaged over the levels of: fem 
    #> Confidence level used: 0.95
    emmeans(glmmtmb.hurdle, ~ mar, comp = "zi", type = "response")
    #>  mar     response     SE  df asymp.LCL asymp.UCL
    #>  Single     0.313 0.0267 Inf     0.263     0.367
    #>  Married    0.296 0.0190 Inf     0.261     0.335
    #> 
    #> Results are averaged over the levels of: fem 
    #> Confidence level used: 0.95 
    #> Intervals are back-transformed from the logit scale
    
    # Response (binomial * truncated count)
    emmeans(pscl.hurdle, ~ mar, mode = "response")
    #>  mar     emmean     SE  df lower.CL upper.CL
    #>  Single    1.64 0.0900 909     1.47     1.82
    #>  Married   1.69 0.0633 909     1.57     1.82
    #> 
    #> Results are averaged over the levels of: fem 
    #> Confidence level used: 0.95
    emmeans(glmmtmb.hurdle, ~ mar, comp = "response")
    #>  mar     emmean     SE  df asymp.LCL asymp.UCL
    #>  Single    1.64 0.0900 Inf      1.47      1.82
    #>  Married   1.69 0.0633 Inf      1.57      1.82
    #> 
    #> Results are averaged over the levels of: fem 
    #> Confidence level used: 0.95
    
    sessionInfo()
    #> R version 4.2.0 (2022-04-22 ucrt)
    #> Platform: x86_64-w64-mingw32/x64 (64-bit)
    #> Running under: Windows 10 x64 (build 22000)
    #> 
    #> Matrix products: default
    #> 
    #> locale:
    #> [1] LC_COLLATE=Spanish_Spain.utf8  LC_CTYPE=Spanish_Spain.utf8   
    #> [3] LC_MONETARY=Spanish_Spain.utf8 LC_NUMERIC=C                  
    #> [5] LC_TIME=Spanish_Spain.utf8    
    #> 
    #> attached base packages:
    #> [1] stats     graphics  grDevices utils     datasets  methods   base     
    #> 
    #> other attached packages:
    #> [1] pscl_1.5.5.1  glmmTMB_1.1.7 emmeans_1.8.8
    #> 
    #> loaded via a namespace (and not attached):
    #>  [1] Rcpp_1.0.10         nloptr_2.0.1        compiler_4.2.0     
    #>  [4] highr_0.9           TMB_1.9.6           tools_4.2.0        
    #>  [7] boot_1.3-28         lme4_1.1-29         digest_0.6.29      
    #> [10] nlme_3.1-157        evaluate_0.15       lifecycle_1.0.3    
    #> [13] lattice_0.20-45     rlang_1.1.1         reprex_2.0.2       
    #> [16] Matrix_1.4-1        cli_3.6.1           rstudioapi_0.13    
    #> [19] yaml_2.3.5          mvtnorm_1.1-3       xfun_0.40          
    #> [22] fastmap_1.1.0       coda_0.19-4         withr_2.5.0        
    #> [25] stringr_1.5.0       knitr_1.39          fs_1.5.2           
    #> [28] vctrs_0.6.3         grid_4.2.0          glue_1.6.2         
    #> [31] survival_3.3-1      rmarkdown_2.14      multcomp_1.4-19    
    #> [34] minqa_1.2.4         TH.data_1.1-1       magrittr_2.0.3     
    #> [37] codetools_0.2-18    htmltools_0.5.6     splines_4.2.0      
    #> [40] MASS_7.3-57         xtable_1.8-4        numDeriv_2016.8-1.1
    #> [43] sandwich_3.0-1      estimability_1.4.1  stringi_1.7.6      
    #> [46] zoo_1.8-10
    

    Created on 2023-09-14 with reprex v2.0.2