Search code examples
rstatisticsregressiongammgcv

mgcv: obtain predictive distribution of response given new data (negative binomial example)


In GAM (and GLM, for that matter), we're fitting a conditional likelihood model. So after fitting the model, for a new input x and response y, I should be able to compute the predictive probability or density of a specific value of y given x. I might want to do this to compare the fit of various models on validation data, for example. Is there a convenient way to do this with a fitted GAM in mgcv? Otherwise, how do I figure out the exact form of the density that is used so I can plug in the parameters appropriately?

As a specific example, consider a negative binomial GAM :

## From ?negbin
library(mgcv)
set.seed(3)
n<-400
dat <- gamSim(1,n=n)
g <- exp(dat$f/5)

## negative binomial data... 
dat$y <- rnbinom(g,size=3,mu=g)

## fit with theta estimation...
b <- gam(y~s(x0)+s(x1)+s(x2)+s(x3),family=nb(),data=dat) 

And now I want to compute the predictive probability of, say, y=7, given x=(.1,.2,.3,.4).


Solution

  • Yes. mgcv is doing (empirical) Bayesian estimation, so you can obtain predictive distribution. For your example, here is how.

    # prediction on the link (with standard error)
    l <- predict(b, newdata = data.frame(x0 = 0.1, x1 = 0.2, x2 = 0.3, x3 = 0.4), se.fit = TRUE)
    
    # Under central limit theory in GLM theory, link value is normally distributed
    # for negative binomial with `log` link, the response is log-normal
    p.mu <- function (mu) dlnorm(mu, l[[1]], l[[2]])
    
    # joint density of `y` and `mu`
    p.y.mu <- function (y, mu) dnbinom(y, size = 3, mu = mu) * p.mu(mu)
    
    # marginal probability (not density as negative binomial is discrete) of `y` (integrating out `mu`)
    # I have carefully written this function so it can take vector input
    p.y <- function (y) {
      scalar.p.y <- function (scalar.y) integrate(p.y.mu, lower = 0, upper = Inf, y = scalar.y)[[1]]
      sapply(y, scalar.p.y)
      }
    

    Now since you want probability of y = 7, conditional on specified new data, use

    p.y(7)
    # 0.07810065
    

    In general, this approach by numerical integration is not easy. For example, if other link functions like sqrt() is used for negative binomial, the distribution of response is not that straightforward (though also not difficult to derive).

    Now I offer a sampling based approach, or Monte Carlo approach. This is most similar to Bayesian procedure.

    N <- 1000  # samples size
    
    set.seed(0)
    
    ## draw N samples from posterior of `mu`
    sample.mu <- b$family$linkinv(rnorm(N, l[[1]], l[[2]]))
    
    ## draw N samples from likelihood `Pr(y|mu)`
    sample.y <- rnbinom(1000, size = 3, mu = sample.mu)
    
    ## Monte Carlo estimation for `Pr(y = 7)`
    mean(sample.y == 7)
    # 0.076
    

    Remark 1

    Note that as empirical Bayes, all above methods are conditional on estimated smoothing parameters. If you want something like a "full Bayes", set unconditional = TRUE in predict().

    Remark 2

    Perhaps some people are assuming the solution as simple as this:

    mu <- predict(b, newdata = data.frame(x0 = 0.1, x1 = 0.2, x2 = 0.3, x3 = 0.4), type = "response")
    dnbinom(7, size = 3, mu = mu)
    

    Such result is conditional on regression coefficients (assumed fixed without uncertainty), thus mu becomes fixed and not random. This is not predictive distribution. Predictive distribution would integrate out uncertainty of model estimation.