Search code examples
rcurve-fittingnls

Why do nls and nlsLM work correctly for fitting a Poisson distribution but fail for negative binomial?


I have two artificially generated probability mass distributions, very similar to one another except that one is a Poisson distribution while the other is a negative binomial, with slightly larger variance than the Poisson. I generate them using the R code sample below, then I attempt to re-estimate the initial input parameter values, using either the nls or nlsLM functions:

library(minpack.lm)
library(ggplot2)

# Number of samples (identical for both distributions)
n <- 10000
# Distribution mean (identical for both distributions)
mn <- 10
# Size variable: relevant to negative binomial only; sets the level of
# over-dispersion relative to the Poisson distribution.  Reproduces
# Poisson in the limit that size --> Inf
sz <- 5

# Generate n random samples
psx <- rpois(n, lambda=mn)                    # Poisson
nbx <- rnbinom(n, size=sz, mu=mn)             # negative binomial
# Sort into sample quantiles
psqnt <- unique(sort(psx))                    # Poisson
nbqnt <- unique(sort(nbx))                    # negative binomial
# Generate empirical cdf functions
pscdf <- ecdf(psx)                            # Poisson
pscumdist <- pscdf(psqnt)
nbcdf <- ecdf(nbx)                            # negative binomial
nbcumdist <- nbcdf(nbqnt)
# Place quantiles and cdf into data frame
psdata <- data.frame(q=psqnt, cdf=pscumdist)  # Poisson
nbdata <- data.frame(q=nbqnt, cdf=nbcumdist)  # negative binomial
# Generate estimated starting values that are modified from true values by
# modest amounts
psstart <- list(lambda=0.8*mn)                # Poisson
nbstart <- list(size=0.8*sz, mu=0.8*mn)       # negative binomial

# Plot the sample density functions
pldata <- rbind(data.frame(x=psx, type="Poisson"),
                data.frame(x=nbx, type="Negative Binomial"))
pldata$type <- factor(pldata$type, levels=c("Poisson", "Negative Binomial"))
hst <- ggplot(pldata, aes(x)) +
       geom_histogram(binwidth=1) +
       facet_grid(type ~ .) +
       theme_gray(base_size=18)
print(hst)

# Re-estimate the Poisson distribution parameter, lambda, using either
# nls or nlsLM
print("Starting Poisson fit now...")
#psfit <- nls(cdf ~ ppois(q, lambda), data=psdata, start=psstart, trace=TRUE)
psfit <- nlsLM(cdf ~ ppois(q, lambda), data=psdata, start=psstart, trace=TRUE)
print(coef(psfit))

# Re-estimate the two negative binomial distribution parameters, size and mu,
# using the same technique
print("Starting negative binomial fit now...")
#nbfit <- nls(cdf ~ pnbinom(q, size, mu), data=nbdata, start=nbstart, trace=TRUE)
nbfit <- nlsLM(cdf ~ pnbinom(q, size, mu), data=nbdata, start=nbstart, trace=TRUE)
print(coef(nbfit))

The call to ggplot generates a pair of histograms showing two discrete probability mass distributions that are obviously quite similar: Poisson distribution and negative binomial distribution

Here are the results of running nlsLM (nls also gives quite similar results except that the trace offers slightly less information):

> source('~/Desktop/nls_error.R')
[1] "Starting Poisson fit now..."
It.    0, RSS =   0.369437, Par. =          8
It.    1, RSS = 0.00130718, Par. =     9.8698
It.    2, RSS = 9.26239e-05, Par. =    9.98602
It.    3, RSS = 9.26083e-05, Par. =     9.9856
It.    4, RSS = 9.26083e-05, Par. =     9.9856
  lambda 
9.985601 
[1] "Starting negative binomial fit now..."
It.    0, RSS =        nan, Par. =          4          8
It.    1, RSS = 2.122e-314, Par. =          4          8
Error in numericDeriv(form[[3L]], names(ind), env) : 
  Missing value or an infinity produced when evaluating the model
In addition: Warning messages:
1: In pnbinom(q, size, mu) : NaNs produced
2: In pnbinom(q, size, mu) : NaNs produced
3: In pnbinom(q, size, mu) : NaNs produced
4: In pnbinom(q, size, mu) : NaNs produced
5: In pnbinom(q, size, mu) : NaNs produced
6: In pnbinom(q, size, mu) : NaNs produced

My question: I deliberately constructed the two examples to be as similar as possible, so why does one succeed while the other fails?


Solution

  • That is because you are being cheated by the default argument order in R. From the help page of pnbinom() we see that pnbinom() has the following syntax:

    pnbinom(q, size, prob, mu, lower.tail = TRUE, log.p = FALSE)

    You provide three arguments to pnbinom() in your call

    nls(cdf ~ pnbinom(q, size, mu), data=nbdata, start=nbstart, trace=TRUE)
    

    but even though your parameter is called mu it is the third argument and consequently corresponds to prob in pnbinom(). Since you are using the alternative formulation you need to name the argument to make sure it is interpreted as mu. The following line works as you intended

    > nbfit <- nls(cdf ~ pnbinom(q, size, mu=mu), data=nbdata, start=nbstart, trace=TRUE)
    0.2185854 :  4 8
    0.004568844 :  4.069641 9.972202
    0.0001207377 :  4.921435 9.961606
    3.952388e-05 :  5.068563 9.966108
    3.948957e-05 :  5.071698 9.966222
    3.948957e-05 :  5.071696 9.966224
    

    You could run into problems if nls() tries to force, say, the size to be negative. That could be made more stable by exponentiating your input parameters

    > nbfit <- nls(cdf ~ pnbinom(q, exp(size), 
                   mu=exp(mu)), data=nbdata, start=nbstart2, trace=TRUE)
    0.2971457 :  3.688879 2.079442
    0.2622977 :  0.4969337 2.1664490
    0.00517649 :  1.408688 2.316948
    6.196776e-05 :  1.610170 2.298254
    3.948972e-05 :  1.623637 2.299200
    3.948957e-05 :  1.623675 2.299202  
    

    where nbstart2 is identical to nbstart except the starting parameters are logged.