Search code examples
rdistributionfitdistrplus

Error code 100 using fitdist in the fitdistrplus package for pois and nbinom


I am trying to fit a distribution to my data (ssdist; ssi is the values), however I keep getting thrown the error code below, and I have no idea why. Notably, the error is only thrown for pois and nbinom which are the distributions that I think might fit the data best.

Error in fitdistrplus::fitdist(ss$ssi[ss$name == s], distr = "nbinom") : the function mle failed to estimate the parameters, with the error code 100

Here is my code:

ssdist <- read_excel(here("Data/Raw/ssdist.xlsx")) %>%
  verify(has_all_names("name", "ssi")) %>% 
  assert(name) %>% 
  assert(is.numeric, ssi)

library(fitdistrplus)

par0 <- par(mfrow = c(3,2))
for(s in unique(ssdist$name)) {
  fitdistrplus::descdist(ss$ssi[ss$name == s], discrete = TRUE)
  title(sub = s)
}
par(par0)

This is the code block giving me issues

    par0 <- par(mfrow = c(2,2))
for(s in unique(ssdist$name)) {
  f <- fitdistrplus::fitdist(ss$ssi[ss$name == s], distr = "nbinom")
  fitdistrplus::cdfcomp(f, main = "Cumulative distribution: Data vs. theoretical")
  title(sub = s)
  mtext(side = 3, line = 0,
        text = paste0("p-value: ", round(fitdistrplus::gofstat(f)$chisqpvalue, 4)))
}
par(par0)

Here is a link to my data (I was unsure how to post it here): https://docs.google.com/spreadsheets/d/1B-9sygnfDd8rUjyN3ZO6DwpvMO95FLMi02RlKpDC7Ig/edit?usp=sharing


Solution

  • Too long for a comment.

    Lot's of problems here. The error you're getting is just the tip of the iceberg, so to speak.

    The default method in fitdist(...) is maximum likelihood estimation (mle). fitdist(...) uses the optim(...) function to do this (actually to minimize the negative log likelihood, which amounts to the same thing). This error is coming from optim(...) and means the minimization failed. It's pretty common when either the choice of distribution is poor, or the initial values for the distribution parameters are poorly chosen. In your case it's a bit of both.

    Your data has non-integer values. The negative binomial distribution has support on the positive integers (incl 0), so by definition the probability density for non-integer x is 0. If you read the documentation on dnbinom(...) you'll see that.

    A second problem is that you data has very long tails:

    library(data.table)
    library(ggplot2)
    setDT(ss)
    ggplot(ss, aes(x=ssi)) + 
      geom_histogram(aes(fill=name), color='grey80') + 
      scale_y_continuous(breaks=2*(0:5))+
      scale_fill_discrete(guide='none')+
      facet_wrap(~name)
    

    enter image description here

    so it's doubtful that any distribution in the exponential family would provide an adequate fit. One possibility is log normal, but this distribution has support on the positive reals, and your data has zeros.

    Another possibility is the Weibull distribution:

    get.params <- function(x) {
      start  <- list(shape=1, scale=1)
      f      <- fitdist(x, distr = dweibull, start=start, method='mge')
      return(c(as.list(f$estimate)))
    }
    params <- ss[, get.params(ssi), by=.(name)]
    

    Which seems to do a decent job dealing with the long tails:

    ss[params, c('shape', 'scale'):=.(i.shape, i.scale), on=.(name)]
    setorder(ss, name, ssi)
    ss[, sample:=quantile(ssi, probs = ppoints(.N)), by=.(name)]
    ss[, theoretical:=qweibull(ppoints(.N), shape, scale), by=.(name)]
    ss[, dens:=dweibull(ssi, shape, scale), by=.(name)]
    ss[, smpl.CDF:=ecdf(ssi)(ssi), by=.(name)]
    ss[, theor.CDF:=pweibull(ssi, shape, scale), by=.(name)]
    ggplot(ss, aes(x=ssi))+
      geom_line(aes(y=theor.CDF, color=name))+
      geom_point(aes(y=smpl.CDF), shape=21)+
      scale_color_discrete(guide='none')+
      labs(title='CDF Plots')+
      facet_wrap(~name)
    

    enter image description here

    But even here you really should look at the q-q plots:

    ggplot(ss, aes(x=theoretical, y=sample))+
      geom_point(aes(color=name))+
      geom_abline(color='blue', linetype='dashed')+
      scale_color_discrete(guide='none')+
      labs(title='QQ Plots')+
      facet_wrap(~name, scales = 'free_x')
    

    enter image description here

    which really highlight the extreme tails. IMO the chi-sq gof metrics are pretty useless in this situation.