Search code examples
rggplot2plotbayesianmcmc

ggplot equivalent of gelman.plot MCMC diagnostic in r


I am creating a series of MCMC diagnostic plots in r using ggplot. I realize there is already a package available in gg for MCMC plotting, but much of this is for my own education as well as practical use. One thing I can't seem to figure out is how to generate the gelman.plot in a ggplot framework.

The gelman.diag function only returns a simple data point and I would like to recreate the complete running chart as shown in gelman.plot.

Is anyone familiar with the algorithmic structure of the gelman potential scale reduction factor and/or a means to port its output to ggplot?

Thank you!


Solution

  • You haven't provided a reproducible example, so I've used the example here. We need the object called combinedchains from that example. In order to avoid cluttering the answer, I've put the code for that at the end of this post.

    Now we can run gelman.plot on combined.chains. This is the plot we want to duplicate:

    library(coda)
    gelman.plot(combined.chains)
    

    enter image description here

    To create a ggplot version, we need to get the data for the plot. I haven't done MCMC before, so I'm going to let gelman.plot generate the data for me. For your actual use case, you can probably just generate the appropriate data directly.

    Let's look at what gelman.plot is doing: We can see the code for that function by typing the bare function name in the console. A portion of the function code is below. The ... show where I've removed sections of the original code for brevity. Note the call to gelman.preplot, with the output of that function stored in y. Note also that y is returned invisibly at the end. y is a list containing the data we need to create a gelman.plot in ggplot.

    gelman.plot = function (x, bin.width = 10, max.bins = 50, confidence = 0.95, 
              transform = FALSE, autoburnin = TRUE, auto.layout = TRUE, 
              ask, col = 1:2, lty = 1:2, xlab = "last iteration in chain", 
              ylab = "shrink factor", type = "l", ...) 
    { 
      ...
      y <- gelman.preplot(x, bin.width = bin.width, max.bins = max.bins, 
                          confidence = confidence, transform = transform, autoburnin = autoburnin)
    ...
      return(invisible(y))
    }
    

    So, let's get the data that gelman.plot returns invisibly and store it in an object:

    gp.dat = gelman.plot(combinedchains)
    

    Now for the ggplot version. First, gp.dat is a list and we need to convert the various parts of that list into a single data frame that ggplot can use.

    library(ggplot2)
    library(dplyr)
    library(reshape2)
    
    df = data.frame(bind_rows(as.data.frame(gp.dat[["shrink"]][,,1]), 
                              as.data.frame(gp.dat[["shrink"]][,,2])), 
                    q=rep(dimnames(gp.dat[["shrink"]])[[3]], each=nrow(gp.dat[["shrink"]][,,1])),
                    last.iter=rep(gp.dat[["last.iter"]], length(gp.dat)))
    

    For the plot, we'll melt df into long format, so that we can have each chain in a separate facet.

    ggplot(melt(df, c("q","last.iter"), value.name="shrink_factor"), 
           aes(last.iter, shrink_factor, colour=q, linetype=q)) + 
      geom_hline(yintercept=1, colour="grey30", lwd=0.2) +
      geom_line() +
      facet_wrap(~variable, labeller= labeller(.cols=function(x) gsub("V", "Chain ", x))) +
      labs(x="Last Iteration in Chain", y="Shrink Factor",
           colour="Quantile", linetype="Quantile") +
      scale_linetype_manual(values=c(2,1))
    

    enter image description here


    MCMC example code to create the combinedchains object (code copied from here):

    trueA = 5
    trueB = 0
    trueSd = 10
    sampleSize = 31
    
    x = (-(sampleSize-1)/2):((sampleSize-1)/2)
    y =  trueA * x + trueB + rnorm(n=sampleSize,mean=0,sd=trueSd)
    
    likelihood = function(param){
      a = param[1]
      b = param[2]
      sd = param[3]
    
      pred = a*x + b
      singlelikelihoods = dnorm(y, mean = pred, sd = sd, log = T)
      sumll = sum(singlelikelihoods)
      return(sumll)
    }
    
    prior = function(param){
      a = param[1]
      b = param[2]
      sd = param[3]
      aprior = dunif(a, min=0, max=10, log = T)
      bprior = dnorm(b, sd = 5, log = T)
      sdprior = dunif(sd, min=0, max=30, log = T)
      return(aprior+bprior+sdprior)
    }
    
    proposalfunction = function(param){
      return(rnorm(3,mean = param, sd= c(0.1,0.5,0.3)))
    }
    
    run_metropolis_MCMC = function(startvalue, iterations) {
      chain = array(dim = c(iterations+1,3))
      chain[1,] = startvalue
      for (i in 1:iterations) {
        proposal = proposalfunction(chain[i,])
    
        probab = exp(likelihood(proposal) + prior(proposal) - likelihood(chain[i,]) - prior(chain[i,]))
    
        if (runif(1)  <  probab){
          chain[i+1,] = proposal
        }else{
          chain[i+1,] = chain[i,]
        }
      }
      return(mcmc(chain))
    }
    
    startvalue = c(4,2,8)
    chain = run_metropolis_MCMC(startvalue, 10000)
    
    chain2 = run_metropolis_MCMC(startvalue, 10000)
    combinedchains = mcmc.list(chain, chain2)
    

    UPDATE: gelman.preplot is an internal coda function that's not directly visible to users. To get the function code, in the console type getAnywhere(gelman.preplot). Then you can see what the function is doing and, if you wish, construct your own function to return the appropriate diagnostic data in a form more suitable for ggplot.