Search code examples
rnormal-distribution

How to calculate multivariate normal distribution function in R


Here's what I tried, making use of the mvtnorm package

Sample Dataset

library(mvtnorm)

set.seed(2357)
df <- data.frame(
  x = rnorm(1000, mean=80, sd=20),
  y = rnorm(1000, mean=0, sd=5),
  z = rnorm(1000, mean=0, sd=5)
)

head(df)
      x      y       z
1 70.38  1.307  0.2005
2 59.76  5.781 -3.5095
3 54.14 -1.313 -1.9022
4 79.91  7.754 -6.2076
5 87.07  1.389  1.1065
6 75.89  1.684  6.2979

Fit multivariate normal dist and check P(x <= 80) ~ 0.5

# Get the dimension means and correlation matrix
means <- c(x=mean(df$x), y=mean(df$y), z=mean(df$z))
corr <- cor(df)

# Check P(x <= 80)
sum(df$x <= 80)/nrow(df)  # 0.498
pmvnorm(lower=-Inf, upper=c(80, Inf, Inf), mean=means, corr=corr)  # 0.8232

Why is the fitted result 0.82? Where did I go wrong?


Solution

  • First, you don't need to simulate anything to study the pmvnorm function:

    pmvnorm(lower=rep(-Inf, 3), upper=c(80, Inf, Inf), mean=c(80,0,0), corr=diag(rep(1,3)))
    

    The result is 0.5, as you expected.

    Your means vector is approximately (79, 0, 0), so let's try it:

    pmvnorm(lower=rep(-Inf, 3), upper=c(80, Inf, Inf), mean=c(79,0,0), corr=diag(rep(1,3)))
    

    The result now is 0.8413447. There's nothing the matter. By specifying only the correlation matrix, you told the software to assume that all variances were unity. In your simulation, the variances were 400, 25, and 25: very different from what you specified in the arguments!

    The correct calculation uses the covariance matrix of the data, not its correlation matrix:

    pmvnorm(lower=rep(-Inf, 3), upper=c(80, Inf, Inf), mean=means, sigma=cov(df))
    

    The result is 0.5178412, quite in keeping with the data.