r statistics mean variance kernel-density

Calculating mean and variance from a density kernel using Monte Carlo methods

Assuming a probability distribution has a density kernel of , what Monte Carlo methods can I use to estimate the mean and variance of the distribution?

Solution

We can use numerical methods here. First of all, we create a function to represent your probability density function (though this is not yet scaled so that its integral is 1 over its full domain):

pdf <- function(x) x^2 * exp(-x^2/4)

plot(pdf, xlim = c(0, 10))

To get the scaling factor to turn this into a genuine pdf, we can integrate this function over its domain of c(0, Inf).

integrate(pdf, 0, Inf)$value
#> [1] 3.544908

So now we can generate a genuine pdf by dividing our original pdf by this amount:

pdf <- function(x)  x^2 * exp(-x^2/4) / 3.544908

plot(pdf, xlim = c(0, 10))

Now that we have a pdf, we can create a cdf with numerical integration:

cdf <- function(x) sapply(x, \(i) integrate(pdf, 0, i)$value)

plot(cdf, xlim = c(0, 10))

The inverse of the cdf is what we need, to be able to convert a sample taken from a uniform distribution between 0 and 1 into a sample drawn from our new distribution. We can create this inverse function using uniroot to find where the output of our cdf matches an arbitrary number between 0 and 1:

inverse_cdf <- function(p) 
{
  sapply(p, function(i) {
              uniroot(function(a) {cdf(a) - i}, c(0, 100))$root
  })
}

The inverse cdf looks like this:

plot(inverse_cdf, xlim = c(0, 0.99))

We are now ready to draw a sample from our distribution:

set.seed(1) # Makes this draw reproducible

x_sample <- inverse_cdf(runif(1000))

Now we can plot a histogram of our sample and ensure it matches the pdf:

hist(x_sample, freq = FALSE)

plot(function(x) pdf(x), add = TRUE, xlim = c(0, 6))

Now that we have a sample drawn from x, we can use the sample mean and variance as estimates for the distribution's mean and variance:

mean(x_sample)
#> [1] 2.264438

var(x_sample)
#> [1] 0.9265678

We can increase the accuracy of these estimates by taking larger and larger samples in our call to inverse_cdf(runif(1000)), by increasing the 1000 to a larger number.

^{Created on 2021-11-06 by the reprex package (v2.0.0)}