Search code examples
rdensity-plot

How to select data from a range within a density in R


Not sure about how to tackle this - I have a data distribution where data selection based on standard deviation does not include all data points (data is more variable on one end than on the other). However, when plotting a density plot I can see that all data outside the 8th blue ring are what I want to select.

Example code:

x <- sort(rnorm(1300, mean = 0, sd = 1))
y <- rnorm(1300, mean = 0, sd = 1)
x <- c(x, rnorm(300, mean = 4, sd = 2), rnorm(600, mean = -2, sd = 2))
y <- c(y, rnorm(300, mean = 3, sd = 4), rnorm(600, mean = -2, sd = 2))

mydata <- data.frame(x,y)

ggplot(data = mydata, aes(x = x, y = y)) +
  geom_point(cex = 0.5) +
  geom_density_2d()

Solution

  • I adapted this from http://slowkow.com/notes/ggplot2-color-by-density/. Under the hood, geom_density_2d uses the MASS::kde2d function, so we can also apply it to the underlying data to subset by density.

    set.seed(42)
    x <- sort(rnorm(1300, mean = 0, sd = 1))
    y <- rnorm(1300, mean = 0, sd = 1)
    x <- c(x, rnorm(300, mean = 4, sd = 2), rnorm(600, mean = -2, sd = 2))
    y <- c(y, rnorm(300, mean = 3, sd = 4), rnorm(600, mean = -2, sd = 2))
    
    mydata <- data.frame(x,y) 
    
    # Copied from http://slowkow.com/notes/ggplot2-color-by-density/
    get_density <- function(x, y, n = 100) {
      dens <- MASS::kde2d(x = x, y = y, n = n)
      ix <- findInterval(x, dens$x)
      iy <- findInterval(y, dens$y)
      ii <- cbind(ix, iy)
      return(dens$z[ii])
    }
    mydata$density <- get_density(mydata$x, mydata$y)
    

    Select points based on arbitrary contour

    EDIT: Changed to allow selection based on contour levels

    # First create plot with geom_density
    gg <- ggplot(data = mydata, aes(x = x, y = y)) +
      geom_point(cex = 0.5) +
      geom_density_2d(size = 1, n = 100)
    gg
    
    # Extract levels denoted by contours by going into the 
    #   ggplot build object. I found these coordinates by 
    #   examining the object in RStudio; Note, the coordinates 
    #   would change if the layer order were altered.
    gb <- ggplot_build(gg)
    contour_levels <- unique(gb[["data"]][[2]][["level"]])
    # contour_levels
    # [1] 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08
    
    # Add layer that relies on given contour level
    gg2 <- gg +
      geom_point(data = mydata %>% 
                   filter(density <= contour_levels[1]), 
                 color = "red", size = 0.5)
    gg2
    

    enter image description here