Search code examples
rstatisticskernel-densityprobability-densitydensity-plot

1D kernel estimation to compare PDF ratios: how to set tails?


Trying to create 1D kernels from data observations (weekly time points). I have been using the density() function from the stats R package. I am doing this so I can take the ratio of a treatment's pdf and the control pdf along the x-axis, and then perform permutation tests with it.

Long story short, I know in my observations that at weeks 4 and 16 (the to and from values in the function), that there is no observations. Yet, my tails often go beyond these points.

From a statistical standpoint, I am not sure how to "force" the function to estimate the pdf with 0s at these points. Having 0s at the correct points (and same between all treatments) is critical for me being able to do the downstream analysis.

I have played around the bw, adjust, n, and to/from parameters, and also attempted to manually set the density values to 0 at known ends. I'm just not sure what to do from here, but I know I'm doing something very naive...

An example of the code I have is:

common_x = seq(4,16,by=0.02)

pdf_ctrl = density(control$week, bw=0.7, from=min(common_x),to=max(common_x), n=500)
pdf1 = density(treatment1$week, bw=0.7, from=min(common_x),to=max(common_x), n=500)

Then take the ratios at each point of x (since they have the same common x values)

Not sure if a better question for the stats exchange...unclear if I have a statistics problem or an R code issue.

No dummy code since it's so straightforward, but below is a plot example of what I am talking about, and happy to provide more code/data if helpful.

example estimated pdf


Solution

  • To understand why the density is not at zero even though you know the counts are zero at a certain point, you need to understand how a 1-D kernel estimate is generated.

    Essentially, density places a kernel at each point on the x axis where you have data. By default, this kernel is a normal distribution, with a standard deviation controlled by the bw parameter.

    The height of the kernel at each point is scaled by the total number of data points, so that if you have 100 data points, the bell curve will be 1/100 the height of a normal distribution.

    density then adds all those bell curves up, giving the final result.

    Let's take a short vector of integers spread between 5 and 15 to demonstrate:

    df <- data.frame(x = c(5, 5, 5, 7, 8, 8, 8, 9, 9, 10, 10, 
                           10, 10, 11, 11, 13, 13, 14, 15))
    

    Now let's plot the density curve of this vector:

    library(tidyverse)
    
    dens <- density(df$x)
    
    p <- ggplot(as.data.frame(dens[c("x", "y")]), aes(x, y)) +
      geom_line() +
      geom_vline(xintercept = 4, linetype = 2) +
      scale_x_continuous(breaks = 1:20)  +
      theme_bw(base_size = 16) +
      scale_y_continuous(limits = c(0, 0.15), expand = c(0, 0)) +
      guides(fill = "none")
    
    p
    

    Note that even though we had no values at x = 4, the density is not 0 here.

    To find out why, let us recreate the algorithm that density uses. We will get a little normal distribution at each point where we have data. The heights are scaled according to the count at each point:

    df2 <- df %>% 
      reframe(xval = seq(0, 20, 0.1),
              y = dnorm(seq(0, 20, 0.1), first(x), bw.nrd0(df$x)) * n()/nrow(df), 
              .by = x)
    
    p + geom_area(aes(x = xval, fill = factor(x)), 
                  data = df2, position = "identity")
    

    Notice that if we stack all these bell curves on top of each other we get our kernel density estimate:

    p + geom_area(aes(x = xval, fill = factor(x)), data = df2, position = "stack")
    

    We can see immediately that the reason for the non-zero density at x = 4 is that the kernel at x = 5 has "spilled over" onto 4. In theory, we could simply narrow the bandwidth so that this doesn't happen:

    plot(density(df$x, bw = 0.3))
    

    enter image description here

    But now of course we have a new problem: our nice smooth density kernel looks very spikey and is no longer smooth.

    So the size of the tails is inextricably linked to how smooth your density is. If you want the tail to hit zero at 4, you need a lower bandwidth, but then you have a lumpy density plot. If you want a smooth density plot, you can't expect it to magically tail off where your counts are 0. If you manually flatten the tails pist-hoc, then the area under the density curve is no longer 1 and therefore it is not a true p.d.f unless you manually normalize it.

    Of course the purists here would point out that a simple plot of counts would be best here; simple, complete, no assumptions and no wasted ink:

     plot(table(df$x))
    

    enter image description here

    Created on 2024-02-06 with reprex v2.0.2