Inefficient sampling of simple binomial GLM using rstan MCMC

I'm trying to fit a binomial GLM using the rethinking package (draws on rstan MCMC).

The model fits, but the sampling is inefficient and Rhat indicates something went wrong. I don't understand the reason for this fitting problem.

This is the data:

d <- read_csv("https://gist.githubusercontent.com/sebastiansauer/a2519de39da49d70c4a9e7a10191cb97/raw/election.csv")
d <- as.data.frame(dummy)

This is the model:

m1_stan <- map2stan( 
 alist(
    afd_votes ~ dbinom(votes_total, p),
    logit(p) <- beta0 + beta1*foreigner_n,
    beta0 ~ dnorm(0, 10),
    beta1 ~ dnorm(0, 10)
  ),
  data = d, 
  WAIC = FALSE,
  iter = 1000)

Fit diagnostics (Rhat, number of effective samples) indicate that something went wrong:

      Mean StdDev lower 0.89 upper 0.89 n_eff Rhat
beta0 -3.75      0      -3.75      -3.75     3 2.21
beta1  0.00      0       0.00       0.00    10 1.25

The traceplot does not show a "fat hairy caterpillar":

The stan output suggested to increase two parameters, adapt_delta and max_treedepth, which I did. That improved the sampling process somewhat:

      Mean StdDev lower 0.89 upper 0.89 n_eff Rhat
beta0 18.1   0.09      18.11      18.16    28 1.06
beta1  0.0   0.00       0.00       0.00    28 1.06

But as the traceplot shows, there's still something wrong:

The pairs plot looks strange, too:

What I else tried:

I centered/z-standardized the predictor (produced this error: ""Error in sampler$call_sampler(args_list[[i]]) : Initialization failed.")
I tried a Normal model (but it's count data)
I checked that there are no missings (there are none)
I increased the number of iterations to 4000, no improvement
I increased the sd of the priors (models takes ages to fit)

But nothing helped so far. What could be the reason for the ineffecient fitting? What could I try?

Could the large count numbers in each be a problem?

 mean(d_short$afd_votes)
[1] 19655.83

Data excerpt:

 head(d)
  afd_votes votes_total foreigner_n
1     11647      170396       16100
2      9023      138075       12600
3     11176      130875       11000
4     11578      156268        9299
5     10390      150173       25099
6     11161      130514       13000

session info:

sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] viridis_0.5.1      viridisLite_0.3.0  sjmisc_2.7.3       pradadata_0.1.3    rethinking_1.59    rstan_2.17.3       StanHeaders_2.17.2 forcats_0.3.0      stringr_1.3.1     
[10] dplyr_0.7.6        purrr_0.2.5        readr_1.1.1        tidyr_0.8.1        tibble_1.4.2       ggplot2_3.0.0      tidyverse_1.2.1   

loaded via a namespace (and not attached):
 [1] httr_1.3.1         jsonlite_1.5       modelr_0.1.2       assertthat_0.2.0   stats4_3.5.0       cellranger_1.1.0   yaml_2.1.19        pillar_1.3.0       backports_1.1.2   
[10] lattice_0.20-35    glue_1.3.0         digest_0.6.15      rvest_0.3.2        snakecase_0.9.1    colorspace_1.3-2   htmltools_0.3.6    plyr_1.8.4         pkgconfig_2.0.1   
[19] broom_0.5.0        haven_1.1.2        bookdown_0.7       mvtnorm_1.0-8      scales_0.5.0       stringdist_0.9.5.1 sjlabelled_1.0.12  withr_2.1.2        RcppTOML_0.1.3    
[28] lazyeval_0.2.1     cli_1.0.0          magrittr_1.5       crayon_1.3.4       readxl_1.1.0       evaluate_0.11      nlme_3.1-137       MASS_7.3-50        xml2_1.2.0        
[37] blogdown_0.8       tools_3.5.0        loo_2.0.0          data.table_1.11.4  hms_0.4.2          matrixStats_0.54.0 munsell_0.5.0      prediction_0.3.6   bindrcpp_0.2.2    
[46] compiler_3.5.0     rlang_0.2.1        grid_3.5.0         rstudioapi_0.7     labeling_0.3       rmarkdown_1.10     gtable_0.2.0       codetools_0.2-15   inline_0.3.15     
[55] curl_3.2           R6_2.2.2           gridExtra_2.3      lubridate_1.7.4    knitr_1.20         bindr_0.1.1        rprojroot_1.3-2    KernSmooth_2.23-15 stringi_1.2.4     
[64] Rcpp_0.12.18       tidyselect_0.2.4   xfun_0.3           coda_0.19-1

Solution

STAN does better with unit scale, uncorrelated parameters. From the STAN manual §28.4 Model Conditioning and Curvature:

Ideally, all parameters should be programmed so that they have unit scale and so that posterior correlation is reduced; together, these properties mean that there is no rotation or scaling required for optimal performance of Stan’s algorithms. For Hamiltonian Monte Carlo, this implies a unit mass matrix, which requires no adaptation as it is where the algorithm initializes. Riemannian Hamiltonian Monte Carlo performs this conditioning on the fly at every step, but such conditioning is very expensive computationally.

In your case, the beta1 is tied to foreigner_n, which doesn't have unit scale, and so is imbalanced in comparison to beta0. Additionally, since foreigner_n is not centered, both betas are changing the location of p during sampling, hence the posterior correlation.

Standardizing yields a more tractable model. Transforming foreigner_n to be centered and unit scale enables the model to rapidly converge and yields high effective sample sizes. I'd also argue that the betas in this form are more interpretable, since beta0 focuses only on the location of p, while beta1 only concerns how variation in foreigner_n explains variation in afd_votes/total_votes.

library(readr)
library(rethinking)

d <- read_csv("https://gist.githubusercontent.com/sebastiansauer/a2519de39da49d70c4a9e7a10191cb97/raw/election.csv")
d <- as.data.frame(d)
d$foreigner_z <- scale(d$foreigner_n)

m1 <- alist(
  afd_votes ~ dbinom(votes_total, p),
  logit(p) <- beta0 + beta1*foreigner_z,
  c(beta0, beta1) ~ dnorm(0, 1)
)

m1_stan <- map2stan(m1, data = d, WAIC = FALSE,
                    iter = 10000, chains = 4, cores = 4)

Examining the sampling, we have

> summary(m1_stan)
Inference for Stan model: afd_votes ~ dbinom(votes_total, p). 
4 chains, each with iter=10000; warmup=5000; thin=1;  
post-warmup draws per chain=5000, total post-warmup draws=20000.

              mean se_mean   sd         2.5%          25%          50%          75%        97.5% n_eff Rhat 
beta0        -1.95    0.00 0.00        -1.95        -1.95        -1.95        -1.95        -1.95 16352    1 
beta1        -0.24    0.00 0.00        -0.24        -0.24        -0.24        -0.24        -0.24 13456    1 
dev      861952.93    0.02 1.97    861950.98    861951.50    861952.32    861953.73    861958.26  9348    1 
lp__  -17523871.11    0.01 0.99 -17523873.77 -17523871.51 -17523870.80 -17523870.39 -17523870.13  9348    1

Samples were drawn using NUTS(diag_e) at Sat Sep  1 11:48:55 2018.
For each parameter, n_eff is a crude measure of effective sample size, 
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).

And looking at the pairs plot, we see the correlation between the betas is reduced to 0.15:

Additional Analysis

I originally intuited that the uncentered foreigner_n was the main issue. At the same time, I was slightly confused because STAN is using HMC/NUTS, which I've been under the impression are supposed to be rather robust against correlated latent variables. However, I noticed comments in the STAN manual about practical problems with scale invariance due to numerical instability, which are also commented on by Michael Betancourt in a CrossValidated answer (though it is a rather old post). So, I wanted to test whether centering or scaling were the most impactful in improving sampling.

Centering Alone

Centering still results in quite poor performance. Note the effective sample size is literally one effective sample per chain.

library(readr)
library(rethinking)

d <- read_csv("https://gist.githubusercontent.com/sebastiansauer/a2519de39da49d70c4a9e7a10191cb97/raw/election.csv")
d <- as.data.frame(d)
d$foreigner_c <- d$foreigner_n - mean(d$foreigner_n)

m2 <- alist(
  afd_votes ~ dbinom(votes_total, p),
  logit(p) <- beta0 + beta1*foreigner_c,
  c(beta0, beta1) ~ dnorm(0, 1)
)

m2_stan <- map2stan(m2, data = d, WAIC = FALSE,
                    iter = 10000, chains = 4, cores = 4)

which yields

Inference for Stan model: afd_votes ~ dbinom(votes_total, p).
4 chains, each with iter=10000; warmup=5000; thin=1; 
post-warmup draws per chain=5000, total post-warmup draws=20000.

              mean   se_mean          sd         2.5%          25%          50%         75%        97.5% n_eff Rhat
beta0        -0.64       0.4        0.75        -1.95        -1.29        -0.54         0.2         0.42     4 2.34
beta1         0.00       0.0        0.00         0.00         0.00         0.00         0.0         0.00     4 2.35
dev    18311608.99 8859262.1 17270228.21    861951.86   3379501.84  14661443.24  37563992.4  46468786.08     4 1.75
lp__  -26248697.70 4429630.9  8635113.76 -40327285.85 -35874888.93 -24423614.49 -18782644.5 -17523870.54     4 1.75

Samples were drawn using NUTS(diag_e) at Sun Sep  2 18:59:52 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

And still appears to have a problematic pairs plot:

enter image description here

Scaling Alone

Scaling vastly improves the sampling! Even though the resulting posteriors still have rather high correlation, the effective sample sizes are in acceptable ranges, although well below those of full standardization.

library(readr)
library(rethinking)

d <- read_csv("https://gist.githubusercontent.com/sebastiansauer/a2519de39da49d70c4a9e7a10191cb97/raw/election.csv")
d <- as.data.frame(d)
d$foreigner_s <- d$foreigner_n / sd(d$foreigner_n)

m3 <- alist(
  afd_votes ~ dbinom(votes_total, p),
  logit(p) <- beta0 + beta1*foreigner_s,
  c(beta0, beta1) ~ dnorm(0, 1)
)

m3_stan <- map2stan(m2, data = d, WAIC = FALSE,
                    iter = 10000, chains = 4, cores = 4)

yielding

Inference for Stan model: afd_votes ~ dbinom(votes_total, p).
4 chains, each with iter=10000; warmup=5000; thin=1; 
post-warmup draws per chain=5000, total post-warmup draws=20000.

              mean se_mean   sd         2.5%          25%          50%          75%        97.5% n_eff Rhat
beta0        -1.58    0.00 0.00        -1.58        -1.58        -1.58        -1.58        -1.57  5147    1
beta1        -0.24    0.00 0.00        -0.24        -0.24        -0.24        -0.24        -0.24  5615    1
dev      861952.93    0.03 2.01    861950.98    861951.50    861952.31    861953.69    861958.31  5593    1
lp__  -17523870.45    0.01 1.00 -17523873.15 -17523870.83 -17523870.14 -17523869.74 -17523869.48  5591    1

Samples were drawn using NUTS(diag_e) at Sun Sep  2 19:02:00 2018.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

The pairs plot shows that there is still significant correlation:

enter image description here

Hence, while decorrelating variables does improve sampling, eliminating scale imbalances are most important in this model.