Search code examples
rpredictbrms

predict values from brm


I have a brm model that I'd like to generate predicted values for a fixed value of one parameter (length). I think I need to do this with the posterior_predict function (or possibly the posterior_epred function), but these functions both generate large matrices with no labels so it's impossible to know what is actually being generated.

library(tidyverse)
library(brms)
set.seed(42)
df <- data.frame(mass   = rnorm(100, mean = 50, sd = 5),
                 length = rnorm(100, mean = 20, sd = 2),
                 sex    = sample(c("f", "m"), 100, replace = TRUE),
                 site   = sample(LETTERS[1:3], 100, replace = TRUE))

m <- brm(mass ~ length * sex + site, data = df, prior = prior(normal(0, 10)))

newdata <- data.frame(site   = rep(LETTERS[1:3], each = 30),
                      sex    = rep(c("f", "m"), 45),
                      length = 20)
pred1 <- bind_cols(newdata, predict(m, newdata)) # results in a usable dataframe
pred2 <- posterior_predict(m, newdata)           # results in a large matrix with no labels
pred3 <- posterior_epred(m, newdata)             # results in a large matrix with no labels

ggplot(data = pred1, aes(x = Estimate, y = site, col = sex)) + geom_violin()

enter image description here


Solution

  • Explanation of posterior_predict output

    Both posterior_predict() and posterior_epred() produce a 4000x90 matrix. Why 4000 rows? It has 4000 rows because the model fit by brm() uses an MCMC algorithm to sample the joint posterior distribution of the parameters. By default it does this with four Markov chains that return 1000 samples each, for a total of 4000. Why 90 columns? Because newdata in this case has 90 rows. The columns of pred1 and pred2 correspond to the rows of newdata, in the same order.

    Incidentally the difference between posterior_predict() and posterior_epred() is similar to the difference between a prediction interval and a confidence interval around the mean in a frequentist analysis. posterior_predict() gives you draws from the posterior predictive distribution, which incorporates the residual error and thus has higher variance. In contrast posterior_epred() gives you the distribution of the expected value of the posterior.

    Solution without any additional packages

    To get a median and a 95% equal-tailed credible interval of the posterior predictive distribution for each row in newdata you could do this:

    posteriorpredictive_CI <- bind_cols(
      newdata,
      t(apply(pred2, 2L, quantile, probs = c(.5, .025, .975)))
    )
    

    Additional solution using tidybayes package

    You may find the tidybayes package makes this easier. The functions add_predicted_draws() (for the posterior predictive distribution) and add_epred_draws() (for the expected value of the posterior) output long-form "tidy" data frames. These have 360,000 rows because each row of newdata is repeated 4000 times, one for each of the posterior draws. Then the function median_qi(), imported from ggdist which is a dependency of tidybayes, can be used to summarize with equal-tailed credible intervals. This will give you similar results to the previous code, with some extra ID columns identifying the width of the interval and which row of newdata each row of the summary comes from.

    library(tidybayes)
    
    posteriorpredictive_CI <- newdata |>
      add_predicted_draws(m) |>
      median_qi()
    

    The width of the interval is 0.95 by default but can be changed with the .width argument of median_qi().