Search code examples
rsubsetdensity-plot

R - How to subset data based on density distribution?


I have some (a looot of) data and would like to exclude values with low occurrence in order to remove background signal from my data.

Here an example, illustrating the question:

data <- rnorm(10000)
ggplot() +
  geom_density(aes(data)) +
  geom_hline(yintercept = 0.1)

I know how to generally subset data from dataframes, but could not figure out how to do so based on the density distribution. density(data)$y gives me values to reconstruct the curve, but these have no actual counterpart in the dataset ("one value per entry") and therefore can't be used to subset the data.

How could i extract the values from the data, which correspond to the ones above the line?

Any help would be very appreciated. Thanks and greetings!


Solution

  • library(ggplot2)
    data <- rnorm(10000)
    ggplot() +
      geom_density(aes(data)) +
      geom_hline(yintercept = 0.1)
      
    # get the density
    d = density(data)
    # get the density value
    y = d$y[d$y>0.1]
    x = d$x[d$y>0.1]
    # plot the subset data 
    ggplot() +
      geom_density(aes(data)) +
      geom_hline(yintercept = 0.1)+
      geom_point(aes(x,y),shape = "X",color="red")
    # this is your subset data
    x_max = max(x)
    x_min = min(x)
    subset_data = data[(data<x_max)&(data>x_min)]
    

    enter image description here