Search code examples
rpackagestatisticsprobability-distribution

What causes this bug in mvtnorm package in R?


I am conducting some nonlinear optimization using multivariate probabilities as my objective function. I've spent hours thinking I had issue with the optimization algorithm, but actually I've tracked the bug down to the use of the mvtnorm package.

Code

library(mvtnorm)
pmvnorm(
  lower = c(-1.281552 , 7.089083, 0.5193308),
  upper = c(-1.200552, Inf, Inf),
  corr = diag(1, nrow =3)
) 

If you excecute this code, every 10-20 times it will return NaN as opposed to 3.047822e-15.

Why is this the case? Can anyone enlighten me? Also, is there an alternative to mvtnorm::pmvnorm() that will prevent this type of instability?

Edit:

  • according to @LMc in comments you can reproduce the error using seed = 10L as an example
  • @PBull gives a great answer below in the comments
  • @Gregor Thomas in comments describes that using Miwa() fixes it (though Miwa I believe approximates -Inf/+Inf to -1e4/1e4

Session info

R version 4.3.1 (2023-06-16) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Red Hat Enterprise Linux

Matrix products: default BLAS: /usr/lib64/libblas.so.3.4.2 LAPACK: /usr/lib64/liblapack.so.3.4.2

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

time zone: America/New_York tzcode source: system (glibc)

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] mvtnorm_1.2-4

loaded via a namespace (and not attached): [1] compiler_4.3.1 tools_4.3.1 rstudioapi_0.16.0


Solution

  • I can't claim to know much about the Genz-Bretz algorithm, but I recently ported mvt.f to C/SAS so I've been staring at its code for quite some time.

    The result of this routine depends on pseudo-random number generation through Monte Carlo lattice scrambling -- whatever that means -- hence your issue only occurs for certain seeds. Very specifically this is the offending line:

    Y(ND) = MVPHNV( DI + W(ND)*( EI - DI ) )
    

    MVPHNV is actually a wrapper for R's qnorm inverse normal CDF function. Because you start the integration in one of your variates from a pretty extreme value the remaining density is already dangerously close to machine epsilon:

    prt <- function(x, d=30) sprintf(paste0("%.", d, "f"), x)
    
    p <- pnorm(7.089083)
    #> 1
    
    ## Not *exactly* 1 however [see also pnorm(..., lower.tail = FALSE)]
    prt(p)
    #> "0.999999999999324984401027904823"
    

    The Genz-Bretz algorithm perturbs these variates through W(ND), DI and EI above, which in some cases causes it to veer even closer to 1. Eventually the following happens:

    qnorm(1)    ## Or within machine epsilon of 1
    #> Inf
    

    This isn't handled downstream, and a few additions/multiplications later the result becomes NaN from which it never recovers.

    Now, this is indeed a bug in mvtnorm since it doesn't address this edge case correctly, but the underlying issue is that you're trying to do calculations for which the accuracy might be questionable to start with. The "true" answer is within an order of magnitude above machine epsilon, and it having an error of zero is most likely an underflow in the first place. You really can't go much further out into the tails either before you run into problems in general:

    prt(pnorm(8.17))              ## Still not exactly 1
    #> 0.999999999999999888977697537484
    
    pnorm(8.17) == pnorm(8.29)    ## But 8.17 == 8.29?!
    #> TRUE
    
    prt(pnorm(8.30))
    #> 1.000000000000000000000000000000
    

    The calculation with lower.tail = FALSE is a little bit more accurate here but will be worse in the other direction. This is also the reason why mvtnorm::Miwa won't integrate out to infinity in your case; computers simply don't have infinite precision.