Search code examples
rnabayesianstanrstan

Dealing with NAs in Bayesian models


I made a certain Bayesian model, including the typical components (data, model, parameters, likelihood).

This model is a linear regression:

library(ggplot2)
#library (ggedit)
library(plyr)
library(StanHeaders)
library(rstan)

# Equation (1)

for(i in 1:N){
    alphaC_P[i] ~ normal ((alphaC_A[Date[i]]) * (1- (F_T[Date[i]])) +
                            alphaC_T[i] * (F_T[Date[i]]), sigma_C);
    }

Due to memory needs, I am running this analysis on a cluster.
I prepare the list of elements (e.g., #Equation (2): mylist <- list())

Finally, I run the Bayesian analysis on the cluster.

Equation (3)

rstan::stan(file=args[2], data= mylist, cores=12, warmup= 48000, 
                          iter= 50000, chains= 4, seed = 14)

# file=args[2] = Bayesian model

Since my data has NAs, my question is:
Where should I include the instruction to omit/ignore/exclude the NAs?
e.g., should it be in Equation #1, #2 or #3?

Finally, what should I do: omit, ignore, exclude them?

Thanks in advance


Solution

  • Your code example is not very clear. For example, in your first code snippet, you're mixing R with Stan syntax.

    From a Stan perspective it's very simple: Stan does not accept NAs in data. You can do two things:

    • Either change NAs to some arbitrary number and then have your Stan model check for the presence of this magic number in your data and deal with it in a suitable way (i.e. remove, ignore, replace, impute),
    • Or deal with NAs at the R level (i.e. remove, replace, impute) before sending data to the Stan model.

    As to how to deal with NAs, that really depends on your data and data collection process, which only you know the details of (in other words, this needs domain-specific knowledge).

    Lastly, a lot of operations in Stan are vectorised. So instead of writing e.g.

    for (i in 1:N)
        y[i] ~ normal(mu[i], sigma)
    

    you can (and should) write

    y ~ normal(mu, sigma)