Search code examples
rstanrstan

cmdstanR: inference on a bernouli parameter


I built a simple model using a bernouli distribution in R using cmdstanR.

The stan file:


data {
  int<lower=0> N;
  int<lower=0, upper=1> obs_data[N];
}

parameters {
  real<lower=0, upper=1> lambda;
}

model {
  target += beta_lpdf(lambda | 1,1);
  for (n in 1:N) {
    target += bernoulli_logit_lpmf(obs_data[n] | lambda);
  }
}

Then I created 4 bernouli draws, with number of samples as 10, 100, 1000 and 10000. I wanted to observe that with increasing number of data points, the uncertainty associated with the parameter goes down.

The r code is as follows:

extract_lambda_draws <- function(mod, obs_data, iter = 1) {

  dl <- list(N = length(obs_data), obs_data = obs_data)
  print(paste("Model build iteration: ", iter))

  fit <- mod$sample(data = dl, num_chains = 4, num_cores = 4)

  print("Model build competed ...")
  draws <- fit$draws()[,,1] %>% as_tibble() 
  return(round(draws,3))
}

num_tosses <- c(10, 100, 1000, 10000)

results <- tibble()

m <- cmdstan_model("coin-flip.stan")

for (i in num_tosses) {
  coin_tosses <- sample(c(0,1), i, replace = T, prob = c(0.4, 0.6))
  d <- extract_lambda_draws(m, coin_tosses, i)
  d <- d %>% mutate(iter = i)
  results <- rbind(results, d)
}

results %>%
  pivot_longer(cols = c(ends_with("lambda")), names_to = "chains", values_to = "lambda" ) %>% 
  mutate(chains = gsub(".lambda", "", chains)) %>% 
  ggplot(aes(x = lambda)) + geom_density() + facet_wrap(iter~., nrow = 4, ncol = 5)

I get the following density distribution on the parameter

bernouli draws expt1

When I reverse the probability for 0 and 1 to c(0.6, 0.4), I get the following

bernouli daraws expt2

I have 2 questions:

  1. When I create samples from c(0,1) with probability c(0.4, 0.6). I expect lambda to be around 0.6, atleast for the dataset with 10,000 samples. However the posterior mode is ~0.4.

  2. When I create samples from c(0,1) with probability c(0.6, 0.4). I expect lambda to be around 0.4, atleast for the dataset with 10,000 samples. The posterior mode is close to 0.


Solution

  • That's because you use a logit-Bernoulli distribution.

    Then, in the first situation, the posterior concentrates about:

    > car::logit(0.6)
    [1] 0.4054651
    

    In the second situation, one has:

    > car::logit(0.4)
    [1] -0.4054651
    

    But your prior distribution on logit(p) is restricted to the range (0,1). So the posterior is also restricted to this range, and then it concentrates at 0.

    I don't know whether there is a function for the Bernoulli distribution parametrized by p in Stan. But you could do something like that (I'm not sure of the syntax):

    parameters {
      real<lower=0, upper=1> p;
    }
    transformed_parameters {
      lambda = log(p/(1-p)) // not sure of the syntax here
    }
    model {
      target += beta_lpdf(p | 1,1);
      for (n in 1:N) {
        target += bernoulli_logit_lpmf(obs_data[n] | lambda);
      }
    }