Search code examples
rplotggplot2data-visualizationhexagonal-tiles

Using hex binning for downsampling QQ plots


My datasets are pretty large and rendering generated QQ plots is slow and sometimes even freezes my browser. I know that one option that I have is simply to downsample the data vector. However, I wanted to try hex binning technique instead of downsampling. Unfortunately, I couldn't make it work (two of my several attempts are shown below). If downsampling is possible to achieve using hex binning (which I suspect is, as it's similar to histograms), I'd appreciate, if someone could show me how to do it. I use ggplot2. Thanks!

g <- ggplot(df, aes(x=var)) + stat_qq(aes(x = var), geom = "hex")

g <- ggplot(df, aes(x = var, y = ..density..)) + 
    geom_hex(aes(sample = var), stat = "qq")

print (g)

The first call results in the following error message:

Error: stat_qq requires the following missing aesthetics: sample

The second call results in this message:

Error in eval(expr, envir, enclos) : object 'density' not found

UPDATE: I think that more correct variant is this, but I'm not sure what should be the arguments:

g <- ggplot(df, aes(??, ??)) +  stat_binhex()

Solution

  • Not sure if this is what you are looking for exactly, but I offer a couple ways to do hexagonal binning. First with ggplot as you are trying to work with and the second with the package hexbin which seems to look better to me, but just my preference.

        library(ggplot2)
    
        x <- rgamma(1000,8,2)
        y <- rnorm(1000,4,1.5)
        binFrame <- data.frame(x,y)
    
        qplot(x,y,data=binFrame, geom='bin2d') # with ggplot...rectangular binning actually
    
        library(hexbin)
        hexbinplot(y~x, data=binFrame) # with hexbin...actually hexagonal binning
    

    Edit:

    So I was thinking a bit about this at lunch and I think the fundamental issues is that hexbining is a multidimensional data reduction technique and it seems like you are trying to do uni-variate QQ plots on really large sample, but with hexbin in ggplot. At any-rate I can think of a way to do hex bin plots with ggplot, but the best I came up with is to start from scratch and manually construct both the theoretical quantiles (x) and sample quantiles (y). So here is what I came up with.

    Basic QQ-Plot Manually

    # setting up manual QQ plot used to plot with and with out hexbins
    
        xSamp <- rgamma(1000,8,.5) # sample data
        len <- 1000
        i <- seq(1,len,by=1)
        probSeq <- (i-.5)/len # probability grid
        invCDF <- qnorm(probSeq,0,1) # theoretical quantiles for standard normal, but you could compare your sample to any distribution
        orderGam <- xSamp[order(xSamp)] # ordered sampe
        df <- data.frame(invCDF,orderGam)
    
        plot(invCDF,orderGam,xlab="Standard Normal Theoretical Quantiles",ylab="Standardized Data Quantiles",main="QQ-Plot")
        abline(lm(orderGam~invCDF),col="red",lwd=2)

    Regular QQ Plot

    QQ Plot With Hexbins in ggplot:

     ggplot(df, aes(invCDF, orderGam)) + stat_binhex() + geom_smooth(method="lm")
    ![QQ Plot with ggplot][2]

    So at the end of the day this might not scale up readily, but if you are looking to do true multidimensional tests of normality you might think about chi-square plots for multivariate normality. cheers