Search code examples
rdistributionprobabilitymodeling

Calculate probability of point on 2d density surface


If I calculate the 2d density surface of two vectors like in this example:

library(MASS)
a <- rnorm(1000)
b <- rnorm(1000, sd=2)
f1 <- kde2d(a, b, n = 100)

I get the following surface

filled.contour(f1)

enter image description here

The z-value is the estimated density.

My question now is: Is it possible to calculate the probability of a single point, e.g. a = 1, b = -4

[as I'm not a statistician this is maybe the wrong wording. Sorry for that. I would like to know - if this is possible at all - with which probability a point occurs.]

Thanks for every comment!


Solution

  • If you specify an area, then that area has a probability with respect to your density function. Of course a single point does not have a probability different from zero. But it does have a non-zero density at that point. What is that then?

    The density is the limit of integral of that probability density integrated over the area divided by the normal area measure as the normal area measure goes to zero. (It was actual rather hard to state that correctly, needed a few tries and it is still not optimal).

    All this is really basic calculus. It is also fairly easy to write a routine to calculate the integral of that density over the area, although I imagine MASS has standard ways to do it that use more sophisticated integration techniques. Here is a quick routine that I threw together based on your example:

    library(MASS)
    n <- 100
    a <- rnorm(1000)
    b <- rnorm(1000, sd=2)
    f1 <- kde2d(a, b, n = 100)
    lims <- c(min(a),max(a),min(b),max(b))
    
    filled.contour(f1)
    
    prob <- function(f,xmin,xmax,ymin,ymax,n,lims){
      ixmin <- max( 1, n*(xmin-lims[1])/(lims[2]-lims[1]) )
      ixmax <- min( n, n*(xmax-lims[1])/(lims[2]-lims[1]) )
      iymin <- max( 1, n*(ymin-lims[3])/(lims[4]-lims[3]) ) 
      iymax <- min( n, n*(ymax-lims[3])/(lims[4]-lims[3]) )
      avg <- mean(f$z[ixmin:ixmax,iymin:iymax])
      probval <- (xmax-xmin)*(ymax-ymin)*avg
      return(probval)
    }
    prob(f1,0.5,1.5,-4.5,-3.5,n,lims)
    # [1] 0.004788993
    prob(f1,-1,1,-1,1,n,lims)
    # [1] 0.2224353
    prob(f1,-2,2,-2,2,n,lims)
    # [1] 0.5916984
    prob(f1,0,1,-1,1,n,lims)
    # [1] 0.119455
    prob(f1,1,2,-1,1,n,lims)
    # [1] 0.05093696
    prob(f1,-3,3,-3,3,n,lims)
    # [1] 0.8080565
    lims
    # [1] -3.081773  4.767588 -5.496468  7.040882
    

    Caveat, the routine seems right and is giving reasonable answers, but it has not undergone anywhere near the scrutiny I would give it for a production function.