Search code examples
rkernel-densityprobability-densitymapply

Efficiently find empirical density() for many arbitrary sample values (like dnorm(), but for empirical distribution)


Say you've defined an empirical density (sample.density) for a sample of x.sample as in the following:

set.seed(1)
x.sample <- rnorm(100)
sample.density <- density(x.sample)

Now say that we have a gradient, G, for which we would like to know the expected densities

G <- seq(-2,2, length.out=20)

Based on the empirical distribution sample.density, what is the density of each value in G?

If I use a for() loop, I can get the answers like this:

G.dens <- c()
for(i in 1:length(G)){
    t.G <- G[i]
    G.dens[i] <- sample.density$y[which.min(abs(sample.density$x-t.G))]
}

The overarching idea is to do something like dnorm(), but instead of assuming that x is normally distributed with specified mean and s.d., I'd like to use a distribution determined empirically from an arbitrary sample (that isn't necessarily normal or any other well-described distribution that would be in the stats package).


Solution

  • I think the comment by @MrFlick pointed me in the right direction. In addition to the suggested approxfun approach and my example for loop approach, I also realized I could use mapply. Note that my use of approxfun won't exactly match the result by the 2 other approaches which use which.min, but I'm not concerned with that difference in output too much, although others might be.

    First, reproducing the sample data from the question:
    set.seed(1)
    x.sample <- rnorm(100)
    sample.density <- density(x.sample)
    G <- seq(-2,2, length.out=20)
    

    Now, create a function for the loop version

    loop()

    loop <- function(x, ref){
        if(class(ref)!="density"){
            ref <- density(ref)
        }
    
        ref.y <- ref$y
        ref.x <- ref$x
    
        G.dens <- c()
        for(i in 1:length(x)){
            t.G <- x[i]
            G.dens[i] <- ref.y[which.min(abs(ref.x-t.G))]
        }
    
        G.dens
    }
    

    Next, I'll use the approach I came up with that uses mapply

    dsample()

    dsample <- function(x, ref){
    
        if(class(ref)!="density"){
            ref <- density(ref)
        }
    
        XisY <- function(x,y){ # which of several X values most closely matches a single Y value?
            which.min(abs(y-x))
        }
    
        ref.y <- ref$y
        ref.x <- ref$x
    
        # ds <- approxfun(ref)
    
        # ds(x)
    
        ref.y[mapply(XisY, x, MoreArgs=list(y=ref.x))]
    }
    

    Finally, the approach harnessing approxfun as suggested by @MrFlick:

    af()

    af <- function(x, ref){
        if(class(ref)!="density"){
            ref <- density(ref)
        }
    
        # XisY <- function(x,y){ # which of several X values most closely matches a single Y value?
        #   which.min(abs(y-x))
        # }
    
        ref.y <- ref$y
        ref.x <- ref$x
    
        ds <- approxfun(ref)
    
        ds(x)
    
        # ref.y[mapply(XisY, x, MoreArgs=list(y=ref.x))]
    }
    

    Now to compare their speed:

    microbenchmark(
        loop(G, sample.density),
        dsample(G, sample.density),
        af(G, sample.density)
    )
    # Unit: microseconds
    #                        expr     min       lq     mean  median       uq      max neval
    #     loop(G, sample.density) 221.801 286.6675 360.3698 348.065 409.9785  942.071   100
    #  dsample(G, sample.density) 252.641 290.7900 359.0432 368.388 417.1510  520.960   100
    #       af(G, sample.density) 201.331 227.8740 276.0425 253.339 273.6225 2545.081   100
    

    Now compare speed as the size of G increases:

    speed.loop <- c()
    speed.dsample <- c()
    speed.af <- c()
    lengths <- seq(20, 5E3, by=200)
    for(i in 1:length(lengths)){
        G <- seq(-2,2, length.out=lengths[i])
        bm <- microbenchmark(
            loop(G, sample.density),
            dsample(G, sample.density),
            af(G, sample.density), times=10
        )
    
        means <- aggregate(bm$time, by=list(bm$expr), FUN=mean)[,"x"]/1E6 # in milliseconds
        speed.loop[i] <- means[1]
        speed.dsample[i] <- means[2]
        speed.af[i] <- means[3]
    
    
    }
    
    speed.ylim <- range(c(speed.loop, speed.dsample, speed.af))
    plot(lengths, (speed.loop), ylim=(speed.ylim), type="l", ylab="Time (milliseconds)", xlab="# Elements in G")
    lines(lengths, (speed.dsample), col="red")
    lines(lengths, (speed.af), col="blue")
    

    enter image description here