Search code examples
rnormal-distributionprobability-density

Estimating probability density in a range between two x values on simulated data


I want to evaluate the probability density of data I've simulated.

  1. I know that if I simply want to find the probability density for a single x value on the normal distribution, I could use dnorm() in the following way:
dist_mean <- 10
dist_sd <- 0.2
prob_density_on_x_val <- dnorm(x = 9.9,
                               mean = dist_mean,
                               sd = dist_sd)

prob_density_on_x_val

[1] 1.760327
  1. But what if I want to evaluate the probability density for a range between two x values in a simulated data?
dist_mean <- 10
dist_sd <- 0.2

## simulate 100,000 values from the normal distribution, 
## given specific mean and standard deviation.
set.seed(123)
random_vals <- rnorm(n = 100000,
                     mean = dist_mean,
                     sd = dist_sd)


hist(random_vals)

histogram

  1. My 100,000 generated values are raw values, and they do take a normal shape. However, this is not a probability density function, as the area under the curve isn't equal to 1.
library("pracma")
trapz(random_vals)

random_vals
[1] 1000009

My questions:

  1. Given my simulated data, how can I create a probability density function for it?
  2. Once created, how could I estimate the: (1) probability under the curve, and (2) probability density on the curve, for a range between two x values? For example, the probability and probability density between x=9.7 and 10.2. Or any other range.

My attempt to figure this out:

In this comment, @Glen_b says using ecdf() is the way to calculate probability in a range between two x values "a" and "b": ecdf(b)-ecdf(a). However, something doesn't make sense, because:

cdf <- ecdf(random_vals)
range_density <- cdf(10.2)-cdf(9.7)

range_density
[1] 0.77358

How is it possible that the probability density on a point value (x=9.9) was 1.76, but for a range 9.7<x<10.2 it's smaller (0.77)? Both distributions (both the one defined with dnorm and the one simulated with rnorm) have the same mean and sd.

So I think I'm missing something fundamental, and would be grateful for any help. Overall, it seems like a very simple question, but I can't find a straightforward solution despite a lot of reading and digging.

Thanks!

Edit

The thing I was missing was the distinction between:

  • probability for a range of x values: the area under the curve of the pdf
  • probability density for a specific x value: the function's value for a given x value (this is what dnorm() is useful for)
  • probability density for a range along the pdf curve, between two x values (chosen answer + comments addresses that)

Solution

  • You can get the probability density function using the functions density and approxfun.

    DensityFunction = approxfun(density(random_vals), rule=2)
    DensityFunction(9.7)
    [1] 0.6410087
    plot(DensityFunction, xlim=c(9,11))
    

    Probability density function

    You can get the area under the curve using integrate

    AreaUnderCurve = function(lower, upper) {
        integrate(DensityFunction, lower=lower, upper=upper) }
    
    AreaUnderCurve(10,11)
    0.5006116 with absolute error < 6.4e-05
    AreaUnderCurve(9.5,10.5)
    0.9882601 with absolute error < 0.00011
    

    You also ask:

    How is it possible that that the probability density on point value (x=9.9) was 1.76, but for a range 9.7

    The value of the pdf (1.76) is the height of the curve. The value you are getting for the range is the area under the curve. Since the width of the interval is 0.5, it is not a surprise that the area under the curve is less than the height.