Search code examples
rggplot2lmggallycoefficient-of-determination

Tweaking ggpairs() or a better solution to a correlation matrix


I am trying to create a correlation matrix between my X and Y variables and display this information in a nice figure. I am currently using ggpairs() from the GGally package, but if there's a better way to do this then I am happy to try something new. The figure should:

-Fit linear regression models (using lm) between X and Y variables

-Display scatterplots with a regression line

-Display the Coefficient of the determination (R2)

-Map the colour of points/lines/R2 values by group

I have been able to do most of this, but ggpairs only displays the correlation coefficient (r) and not the coefficient of determination (R2). I was able to use the suggestion from this post, but unfortunately the solution does not display R2 values by group.

So far:

library(GGally)
library(ggplot2)

cars <- mtcars
cars$group <- factor(c(rep("A", 16), rep("B", 16))) #adding grouping variable

#function to return R2 (coefficient of determination) and not just r (Coefficient of correlation) in the top portion of the figure
upper_fn <- function(data, mapping, ndp=2, ...){
  
  # Extract the relevant columns as data
  x <- eval_data_col(data, mapping$x)
  y <- eval_data_col(data, mapping$y)
  
  # Calculate the r^2 & format output
  m <- summary(lm(y ~ x))
  lbl <- paste("r^2: ", formatC(m$r.squared, digits=ndp, format="f"))
  
  # Write out label which is centered at x&y position
  ggplot(data=data, mapping=mapping) + 
    annotate("text", x=mean(x, na.rm=TRUE), y=mean(y, na.rm=TRUE), label=lbl, parse=TRUE, ...)+
    theme(panel.grid = element_blank()) 
}


#lower function basically fits a linear model and displays points 
lower_fn <- function(data, mapping, ...){
  p <- ggplot(data = data, mapping = mapping) + 
    geom_point(alpha = 0.7) + 
    geom_smooth(method=lm, fill="blue", se = F, ...)
  p
}

#The actual figure
  ggpairs(cars,
    columns = c(1:11),
    mapping = ggplot2::aes(color = group),
    upper = list(continuous = "cor", size = 15),
    diag = list(continuous = "densityDiag", alpha=0.5),
    lower = list(continuous = lower_fn))

Solution

  • Based on Is it possible to split correlation box to show correlation values for two different treatments in pairplot?, below is a little code to get you started.

    The idea is that you need to 1. split the data over the aesthetic variable (which is assumed to be colour), 2. run a regression over each data subset and extract the r^2, 3. quick calculation of where to place the r^2 labels, 4. plot. Some features are left to do.

    upper_fn <- function(data, mapping, ndp=2, ...){
      
      # Extract the relevant columns as data
      x <- eval_data_col(data, mapping$x)
      y <- eval_data_col(data, mapping$y)
      col <- eval_data_col(data, mapping$colour)
    
      # if no colour mapping run over full data
      if(is.null(col)) {
            ## add something here
        }
    
      # if colour aesthetic, split data and run `lm` over each group
      if(!is.null(col)) {
        idx <- split(seq_len(nrow(data)), col)
        r2 <- unlist(lapply(idx, function(i) summary(lm(y[i] ~ x[i]))$r.squared))
    
        lvs <- if(is.character(col)) sort(unique(col)) else levels(col)
        cuts <- seq(min(y, na.rm=TRUE), max(y, na.rm=TRUE), length=length(idx)+1L)
        pos <- (head(cuts, -1) + tail(cuts, -1))/2 
    
        p <- ggplot(data=data, mapping=mapping, ...) + 
                geom_blank() + 
                theme_void() + 
                # you could map colours to each level 
                annotate("text", x=mean(x), y=pos, label=paste(lvs, ": ", formatC(r2, digits=ndp, format="f")))
        }
      
      return(p)
    }