Search code examples
rstatisticsnormal-distribution

Same code different outputs for fitting Truncated Normal Distribution


I am running the same piece of code, with the same seed, same package versions, same R version, on 3 different systems: 1) my computer 2) a linux cluster and 3) R snippets

packageVersion("truncnorm")
packageVersion("MASS")
set.seed(42)
fit<-NULL
x <- c(0.0916, 0.0084, 0.0442, 0.6254, 0.2021, 0.0135, 0.0259,
       0.1557,0.0191, 0.3575, 0.1843, 0.1792, 0.0476, 0.0765, 
       0.0356, 0.0039, 0.1714, 0.1222, 0.2872, 0.395, 0.3334,
       0.2223, 0.0096, 0.0436, 0.207)
mu0 <- mean(x)
sigma0 <- stats::sd(x)
fit <- MASS::fitdistr(x, densfun = function(xx, mu, sigma) {
    truncnorm::dtruncnorm(xx, a = 0, b = 1, mean = mu, sd = sigma)
}, 
   start = list(mu = mu0, sigma = sigma0), 
   lower = list(mu = -Inf, sigma = 0.05), 
   upper = list(mu = Inf, sigma = Inf))
print(fit)

On my computer, fit is showing as NULL, whereas in the other 2 systems, the model does fit successfully. Any ideas how that is possible?

P.S.: The issue on my system is

Error in MASS::fitdistr(x, densfun = function(xx, mu, sigma) { : optimization failed

If I change the data a little bit, for example remove 0.0084 from the data (this is the second number in the data), the model fits. Giving me the same outputs across all 3 systems.

Here is the sessionInfo() from my own system:

R version 3.6.0 (2019-04-26) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19043)

Matrix products: default

locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C
LC_TIME=English_United States.1252

attached base packages: [1] stats graphics grDevices utils
datasets methods base

other attached packages: [1] truncnorm_1.0-8 MASS_7.3-51.4

loaded via a namespace (and not attached): [1] ks_1.13.2
compiler_3.6.0 Matrix_1.2-17 mclust_5.4.7 tools_3.6.0
simIReff_1.0 [7] mvtnorm_1.1-3 KernSmooth_2.23-15 grid_3.6.0 pracma_2.3.3 lattice_0.20-38


Solution

  • This turns out to be a numerically unstable/sensitive problem.

    If you turn on debug(MASS::fitdistr) and step through, eventually you'll get to a line

     if (res$convergence > 0L) stop("optimization failed")
    

    If you print out the value of res at this point you get (slightly abbreviated):

    $par
           mu     sigma
    -6.411168  1.022651
    $value
    [1] -21.72969
    $counts
    function gradient
          81       81
    $convergence
    [1] 52
    $message
    [1] "ERROR: ABNORMAL_TERMINATION_IN_LNSRCH"
    

    in other words, the L-BFGS-B optimizer (which is used because you specified bounds - it's quite finicky) thinks there is a problem, and fitdistr accordingly throws an error. As far as I can tell there is no way to tell fitdistr "just give me the answer anyway".

    I tried a bunch of different methods (perturbing the starting conditions a little bit, i.e. mu + 1e-3, sigma0 + 1e-3; removing the bounds so that fitdistr uses the more robust Nelder-Mead optimizer instead). Plotting the log10(1e-4 + neg log likelihood) (so that we can see small differences from the minimum negative log-likelihood) gives the following image (code below):

    likelihood surface

    [red: Linux, green: Windows/convergence failure, blue: Windows/perturbed start, cyan: Nelder-Mead]

    These are the corresponding fits to the distribution:

    histogram plus curves

    As you can see (or probably can't!), all the fits are basically identical. If you calculate the negative log-likelihood, you'll see that they differ by less than 0.001 units [i.e., negligibly]. (You can also tell this because all the points in the first image lie within the log10(difference) = -3 contour.)

    So the differences between answers don't really matter, just the annoyance of getting an error. You could (1) use a while loop + try() to perturb the starting value a little bit until you get an answer; (2) drop the bounds to allow Nelder-Mead to work: (3) use bbmle or some other tool that lets you be a little bit more robust/defensive about the optimization procedure ...


    nllfun <- function(mu, sigma) {
      -sum(log(dtruncnorm(x, a = 0, b = 1, mean = mu, sd = sigma)))
    }
    library(emdbook)
    library(truncnorm)
    p1 <- c(-7.02938981, 1.06779942) ## Linux
    p2 <- c(-6.411, 1.022651)  ## Windows (convergence error)
    p3 <- c(-6.587645, 1.0359466) ## Windows (perturbed start)
    p4 <- c(-5.9937989, 0.9901366) ## Windows (Nelder-Mead/no bounds)
    cc <- curve3d(nllfun(x,y), xlim = c(-7.1, -5.98), ylim = c(0.98, 1.07),
                  n = c(101, 101), sys3d = "none")
    
    image(cc$x, cc$y, log10(cc$z-min(cc$z) + 1e-4))
    contour(cc$x, cc$y, log10(cc$z-min(cc$z) + 1e-4), add = TRUE)
    points(p1[1], p1[2], pch = 16, col = 2)
    points(p2[1], p2[2], pch = 17, col = 3)
    points(p3[1], p3[2], pch = 18, col = 4)
    points(p4[1], p4[2], pch = 18, col = 5)
    
    hist(x, freq=FALSE)
    curve(dtruncnorm(x, a=0, b=1, mean=p1[1], sd = p1[2]), col = 2, add=TRUE)
    curve(dtruncnorm(x, a=0, b=1, mean=p2[1], sd = p2[2]), col = 3, add=TRUE)
    curve(dtruncnorm(x, a=0, b=1, mean=p3[1], sd = p3[2]), col = 4, add=TRUE)
    curve(dtruncnorm(x, a=0, b=1, mean=p4[1], sd = p4[2]), col = 5, add=TRUE)