I am trying to build a binomial regression model with rstan.
The aim is to get the effect size bt between two conditions X in a group t in total and the effect size btg for subgroups g.
library(rstan)
df <- data.frame(hits=c(36,1261,36,1261,49,1248,17,7670,25,759,29,755),trials=c(118,53850,184,53784,209,53759,118,53850,184,53784,209,53759)
,X=rep(c(1,0),6),g=rep(rep(1:3, each=2),2),t=rep(1:2,each=6),tg=rep(1:6,each=2) )
stanIn <- list(Nt=length(unique(df$t)), #number of groups t
Nc=length(df$t), #number of rows
Ng=length(unique(df$g)), #number of subgroups g
Ntg=length(unique(df$tg)), #number of t and g combinations
N=df$trials,
n=df$hits,
X=df$X, #condition 1 or 0
t=df$t, #index of groups
g=df$g, #index of subgroups
tg=df$tg) #index of combinations between t and g
model <- stan(data=stanIn, file="minimal.stan", chains = 4)
with minimal.stan as below.
data {
int<lower=1> Nt;
int<lower=1> Nc;
int<lower=1> Ng;
int<lower=1> Ntg;
int<lower=1> N[Nc];
int<lower=0> n[Nc];
int<lower=0,upper=1> X[Nc];
int<lower=1> t[Nc];
int<lower=0> g[Nc];
int<lower=1> tg[Nc];
}
parameters {
real at[Nt]; // group intercepts
real bt[Nt]; // group slopes
real btg[Ntg]; // subgroup slopes
real atg[Ntg]; // subgroup intercept
}
transformed parameters {
vector[Nc] theta; // binomial probabilities
for (i in 1:Nc) { // linear model
theta[i] = inv_logit( (atg[tg[i]]+at[t[i]] ) + (bt[t[i]]+btg[tg[i]]) * X[i]);
//theta[i] = inv_logit(at[t[i]] + bt[t[i]] * X[i]); //group effect
//theta[i] = inv_logit(atg[tg[i]] + btg[tg[i]] * X[i]); //subgroup effects
}
}
model {
at ~ normal(0.0, 20.0);
bt ~ normal(0.0, 20.0);
atg ~ normal(0.0, 20.0);
btg ~ normal(0.0, 20.0);
n ~ binomial(N, theta);
}
I can model the overall group effect with the first commented line (in transformed parameters) and the subgroup effects with the second commented line. The idea was to combine both to get a group effect and the deviation from it for the individual group (first line).
However, this gives really weird results for bt and btg (A), and I was expecting something more like (B) (I can not recreate the behavior seen in A with a minimal example, this only occurs in the full dataset.)
If its not apparent from the type of question, I am completely new to statistic modeling and suspect that I have a conceptional error. So I would be grateful for any hint on this issue or a source to read up on those things (Feels like a common thing but I did not find anything).
I see several issues. Most directly, the model is not identified as the parameters atg
and btg
includes at
and bt
. I've changed those to ag
and bg
as they are your subgroup parameters, and indexed them as such, below:
library(rstan)
df <- data.frame(hits = c(36, 1261, 36, 1261, 49, 1248,
17, 7670, 25, 759, 29, 755),
trials = c(118, 53850, 184, 53784, 209, 53759,
118, 53850, 184, 53784, 209, 53759),
X = rep(c(1,0), 6),
g = rep(rep(1:3, each=2), 2),
t = rep(1:2, each=6),
tg = rep(1:6, each=2) )
stanIn <- list(Nt = length(unique(df$t)), #number of groups t
Nc = length(df$t), #number of rows
Ng = length(unique(df$g)), #number of subgroups g
N = df$trials,
n = df$hits,
X = df$X, #condition 1 or 0
t = df$t, #index of groups
g = df$g) #index of subgroups
model <- stan(data = stanIn, file = "minimal.stan",
cores = 4, chains = 4,
control = list(max_treedepth = 14))
with this modified Stan model samples without issues:
data {
int<lower=1> Nt;
int<lower=1> Nc;
int<lower=1> Ng;
int<lower=1> N[Nc];
int<lower=0> n[Nc];
vector<lower=0,upper=1>[Nc] X;
int<lower=1> t[Nc];
int<lower=0> g[Nc];
}
parameters {
vector<offset=0, multiplier=20>[Nt] at; // group intercepts
vector<offset=0, multiplier=20>[Nt] bt; // group slopes
vector<offset=0, multiplier=20>[Ng] ag; // subgroup intercepts
vector<offset=0, multiplier=20>[Ng] bg; // subgroup slopes
}
model {
at ~ normal(0.0, 20.0);
bt ~ normal(0.0, 20.0);
ag ~ normal(0.0, 20.0);
bg ~ normal(0.0, 20.0);
n ~ binomial_logit(N, ag[g] + at[t] + (bt[t] + bg[g]) .* X);
}
generated quantities {
vector[Nc] theta = inv_logit(ag[g] + at[t] + (bt[t] + bg[g]) .* X);
}
Of note, I had to use a high max_treedepth
to fit the model, but it's hard for me to comment on that without understanding the data. I've also moved theta
to generated quantities
in case you want those calculations, but the binomial_logit
handles this directly.
I've also setup noncentered parameters using offset
and multiplier
so that the Stan sampler has a better parameter space to sample from. Finally, I've re-coded the loops as vectors.