Search code examples
juliabayesianprobabilistic-programming

How to implement a most basic missing data model in Turing?


I would like to write in Julia the following toy Bernoulli model written in JAGS:

model{
  p0 ~ dbeta(1,1)
  for(i in 1:n){
    y[i] ~ dbern(p0)
  }     
}

Here's the complete driver

true_p0 <- 0.2
n <- 600
y <- sample(c(0,1), size=n, prob = c(1-true_p0, true_p0), replace=TRUE)
y[500:600] = NA # Some observations are missing!
data=list(y=y, n=n)
n_iter <- 1000
out<-jags.parallel(data=data, inits=NULL, n.iter=1000,
                   parameters.to.save = c('p0', 'y'), 
                   model.file = 'basic_missing.bug', 
                   n.chains = chain_count)

# The missing values are treated as parameters, so we can extract them: 
print(sum(out$BUGSoutput$sims.list$y[,550])/nrow(out$BUGSoutput$sims.list$y))
# > [1] 0.232375
# The non-missing values are treated as constants, so even if we extract them, there is no variability:
print(sum(out$BUGSoutput$sims.list$y[,200])/nrow(out$BUGSoutput$sims.list$y))
# [1] 0

Here's my Julia model:

using Random
using Distributions
using Turing
using StatsPlots
using Plots

Random.seed!(3)
true_p0=0.2
v=cat(rand(Bernoulli(true_p0), 500), [missing for _ in 1:100], dims=1)

@model function bernoulli_model(v) 
    p0 ~ Beta(1,1)
    v .~ Bernoulli(p0)
end
model=bernoulli_model(v)

nsamples = 1000
nchains = 8
sampler = MH()
chains = sample(model, sampler, MCMCThreads(), nsamples, nchains, discard_initial = 200, thinning=50);
> TaskFailedException
>
>    nested task error: TaskFailedException
>    
>        nested task error: DomainError with 0.0:
>        Beta: the condition α > zero(α) is not satisfied.

What do I do wrong?


Solution

  • I figured out the answer myself.

    There are two issues here:

    1. Bernoulli(p0) expects the values to be integers, and throws a cryptic error when they are booleans. So the first thing to do is to convert the input vector to int by e.g. multiplying the bool values by 1.
    2. v .~ Bernoulli(p0), a vectorized (aka broadcasting) operator "~" is a wrong construct here, probably because the function dispatching happens only once, but we are talking about two completely different operations, that are serviced by two completely different overloads of operator ~ over the Bernoulli distribution. I suspect that one variant is a simple sampling variant that just edits the target log-likelihood, which is applied when "v[i]" is known. The other one is the variant that defines a latent random variable that represents the missing data, which is applied when the "v[i]" is missing. In order to force Julia to dispatch a different operator ~ overload dynamically based on the type of the argument, we need to use the actual loop.

    Here's the corrected code:

    using Random
    using Distributions
    using Turing
    
    Random.seed!(3)
    true_p0=0.2
    v=cat(rand(Bernoulli(true_p0), 500), [missing for _ in 1:100], dims=1)
    
    @model function bernoulli_model(v) 
        p0 ~ Beta(1,1)
        for i in eachindex(v)
            v[i] ~ Bernoulli(p0)
        end
    end
    
    model=bernoulli_model(v)model=bernoulli_model(v)
    
    nsamples = 1000
    nchains = 8
    sampler = MH()
    chains = sample(model, sampler, MCMCThreads(), nsamples, nchains, discard_initial = 200, thinning=50);