Search code examples
rscatterkernel-density

Can someone explain what these lines of code mean?


I have been trying to find a way to make a scatter plot with colour intensity that is indicative of the density of points plotted in the area (it's a big data set with lots of overlap). I found these lines of code which allow me to do this but I want to make sure I actually understand what each line is actually doing. Thanks in advance :)

get_density <- function(x, y, ...){
  dens <- MASS::kde2d(x, y, ...)
  ix <- findInterval(x, dens$x)
  iy <- findInterval(y, dens$y)
  ii <- cbind(ix, iy)
  return(dens$z[ii])
}

set.seed(1)

dat <- data.frame(x = subset2$conservation.phyloP, y = subset2$gene.expression.RPKM)
dat$density <- get_density(dat$x, dat$y, n = 100)  

Solution

  • Below is the function with some explanatory comments, let me know if anything is still confusing:

    # The function "get_density" takes two arguments, called x and y
    # The "..." allows you to pass other arguments 
    get_density <- function(x, y, ...){ 
    
    # The "MASS::" means it comes from the MASS package, but makes it so you don't have to load the whole MASS package and can just pull out this one function to use. 
    # This is where the arguments passed as "..." (above) would get passed along to the kde2d function
    dens <- MASS::kde2d(x, y, ...)
    # These lines use the base R function "findInterval" to get the density values of x and y
    ix <- findInterval(x, dens$x)
    iy <- findInterval(y, dens$y)
    # This command "cbind" pastes the two sets of values together, each as one column
    ii <- cbind(ix, iy)
    # This line takes a subset of the "density" output, subsetted by the intervals above
    return(dens$z[ii])
    }  
    
    # The "set.seed()" function makes sure that any randomness used by a function is the same if it is re-run (as long as the same number is used), so it makes code more reproducible
    set.seed(1)
    
    dat <- data.frame(x = subset2$conservation.phyloP, y = subset2$gene.expression.RPKM)
    dat$density <- get_density(dat$x, dat$y, n = 100)  
    

    If your question is about the MASS::kde2d function itself, it might be better to rewrite this StackOverflow question to reflect that!

    It looks like the same function is wrapped into a ggplot2 method described here, so if you switch to making your plot with ggplot2 you could give it a try.