Search code examples
bayesianrstan

binomial regression with group effects


I am trying to build a binomial regression model with rstan.

The aim is to get the effect size bt between two conditions X in a group t in total and the effect size btg for subgroups g.

library(rstan)

df <- data.frame(hits=c(36,1261,36,1261,49,1248,17,7670,25,759,29,755),trials=c(118,53850,184,53784,209,53759,118,53850,184,53784,209,53759)
                ,X=rep(c(1,0),6),g=rep(rep(1:3, each=2),2),t=rep(1:2,each=6),tg=rep(1:6,each=2) )

stanIn <- list(Nt=length(unique(df$t)), #number of groups t
            Nc=length(df$t), #number of rows
            Ng=length(unique(df$g)), #number of subgroups g
            Ntg=length(unique(df$tg)), #number of t and g combinations
            N=df$trials, 
            n=df$hits,
            X=df$X, #condition 1 or 0
            t=df$t, #index of groups
            g=df$g, #index of subgroups
            tg=df$tg) #index of combinations between t and g

model <- stan(data=stanIn, file="minimal.stan", chains = 4)

with minimal.stan as below.

data {
  int<lower=1> Nt; 
  int<lower=1> Nc; 
  int<lower=1> Ng; 
  int<lower=1> Ntg; 
  int<lower=1> N[Nc];
  int<lower=0> n[Nc]; 
  int<lower=0,upper=1> X[Nc]; 
  int<lower=1> t[Nc]; 
  int<lower=0> g[Nc]; 
  int<lower=1> tg[Nc]; 
}

parameters {
  real at[Nt]; // group intercepts 
  real bt[Nt]; // group slopes  
  real btg[Ntg]; // subgroup slopes 
  real atg[Ntg]; // subgroup intercept
}

transformed parameters {
  vector[Nc] theta; // binomial probabilities

  for (i in 1:Nc) { // linear model
    theta[i] = inv_logit( (atg[tg[i]]+at[t[i]] ) + (bt[t[i]]+btg[tg[i]]) * X[i]);
    //theta[i] = inv_logit(at[t[i]] + bt[t[i]] * X[i]); //group effect
    //theta[i] = inv_logit(atg[tg[i]] + btg[tg[i]] * X[i]); //subgroup effects
  }
}

model {
  at ~ normal(0.0, 20.0);
  bt ~ normal(0.0, 20.0);  
  atg ~ normal(0.0, 20.0);
  btg ~ normal(0.0, 20.0);
  n ~ binomial(N, theta);
}

I can model the overall group effect with the first commented line (in transformed parameters) and the subgroup effects with the second commented line. The idea was to combine both to get a group effect and the deviation from it for the individual group (first line).

However, this gives really weird results for bt and btg (A), and I was expecting something more like (B) (I can not recreate the behavior seen in A with a minimal example, this only occurs in the full dataset.)

enter image description here

If its not apparent from the type of question, I am completely new to statistic modeling and suspect that I have a conceptional error. So I would be grateful for any hint on this issue or a source to read up on those things (Feels like a common thing but I did not find anything).


Solution

  • I see several issues. Most directly, the model is not identified as the parameters atg and btg includes at and bt. I've changed those to ag and bg as they are your subgroup parameters, and indexed them as such, below:

    library(rstan)
    
    df <- data.frame(hits   = c(36, 1261, 36, 1261, 49, 1248, 
                                17, 7670, 25, 759, 29, 755),
                     trials = c(118, 53850, 184, 53784, 209, 53759, 
                                118, 53850, 184, 53784, 209, 53759),
                     X  = rep(c(1,0), 6),
                     g  = rep(rep(1:3, each=2), 2),
                     t  = rep(1:2, each=6),
                     tg = rep(1:6, each=2) )
    
    stanIn <- list(Nt = length(unique(df$t)), #number of groups t
                   Nc = length(df$t),         #number of rows
                   Ng = length(unique(df$g)), #number of subgroups g
                   N  = df$trials, 
                   n  = df$hits,
                   X  = df$X,                 #condition 1 or 0
                   t  = df$t,                 #index of groups
                   g  = df$g)                 #index of subgroups
    
    model <- stan(data = stanIn, file = "minimal.stan", 
                  cores = 4, chains = 4,
                  control = list(max_treedepth = 14))
    

    with this modified Stan model samples without issues:

    data {
      int<lower=1> Nt; 
      int<lower=1> Nc; 
      int<lower=1> Ng; 
      int<lower=1> N[Nc];
      int<lower=0> n[Nc]; 
      vector<lower=0,upper=1>[Nc] X; 
      int<lower=1> t[Nc]; 
      int<lower=0> g[Nc]; 
    }
    
    parameters {
      vector<offset=0, multiplier=20>[Nt] at; // group intercepts 
      vector<offset=0, multiplier=20>[Nt] bt; // group slopes  
      vector<offset=0, multiplier=20>[Ng] ag; // subgroup intercepts
      vector<offset=0, multiplier=20>[Ng] bg; // subgroup slopes
    }
    
    model {
      at ~ normal(0.0, 20.0);
      bt ~ normal(0.0, 20.0);  
      ag ~ normal(0.0, 20.0);
      bg ~ normal(0.0, 20.0);
      n  ~ binomial_logit(N, ag[g] + at[t] + (bt[t] + bg[g]) .* X);
    }
    
    generated quantities {
      vector[Nc] theta = inv_logit(ag[g] + at[t] + (bt[t] + bg[g]) .* X);
    }
    

    Of note, I had to use a high max_treedepth to fit the model, but it's hard for me to comment on that without understanding the data. I've also moved theta to generated quantities in case you want those calculations, but the binomial_logit handles this directly.

    I've also setup noncentered parameters using offset and multiplier so that the Stan sampler has a better parameter space to sample from. Finally, I've re-coded the loops as vectors.