I am trying to fit a distribution to my data (ssdist
; ssi
is the values), however I keep getting thrown the error code below, and I have no idea why. Notably, the error is only thrown for pois
and nbinom
which are the distributions that I think might fit the data best.
Error in fitdistrplus::fitdist(ss$ssi[ss$name == s], distr = "nbinom") : the function mle failed to estimate the parameters, with the error code 100
Here is my code:
ssdist <- read_excel(here("Data/Raw/ssdist.xlsx")) %>%
verify(has_all_names("name", "ssi")) %>%
assert(name) %>%
assert(is.numeric, ssi)
library(fitdistrplus)
par0 <- par(mfrow = c(3,2))
for(s in unique(ssdist$name)) {
fitdistrplus::descdist(ss$ssi[ss$name == s], discrete = TRUE)
title(sub = s)
}
par(par0)
This is the code block giving me issues
par0 <- par(mfrow = c(2,2))
for(s in unique(ssdist$name)) {
f <- fitdistrplus::fitdist(ss$ssi[ss$name == s], distr = "nbinom")
fitdistrplus::cdfcomp(f, main = "Cumulative distribution: Data vs. theoretical")
title(sub = s)
mtext(side = 3, line = 0,
text = paste0("p-value: ", round(fitdistrplus::gofstat(f)$chisqpvalue, 4)))
}
par(par0)
Here is a link to my data (I was unsure how to post it here): https://docs.google.com/spreadsheets/d/1B-9sygnfDd8rUjyN3ZO6DwpvMO95FLMi02RlKpDC7Ig/edit?usp=sharing
Too long for a comment.
Lot's of problems here. The error you're getting is just the tip of the iceberg, so to speak.
The default method in fitdist(...)
is maximum likelihood estimation (mle). fitdist(...)
uses the optim(...)
function to do this (actually to minimize the negative log likelihood, which amounts to the same thing). This error is coming from optim(...)
and means the minimization failed. It's pretty common when either the choice of distribution is poor, or the initial values for the distribution parameters are poorly chosen. In your case it's a bit of both.
Your data has non-integer values. The negative binomial distribution has support on the positive integers (incl 0), so by definition the probability density for non-integer x is 0. If you read the documentation on dnbinom(...)
you'll see that.
A second problem is that you data has very long tails:
library(data.table)
library(ggplot2)
setDT(ss)
ggplot(ss, aes(x=ssi)) +
geom_histogram(aes(fill=name), color='grey80') +
scale_y_continuous(breaks=2*(0:5))+
scale_fill_discrete(guide='none')+
facet_wrap(~name)
so it's doubtful that any distribution in the exponential family would provide an adequate fit. One possibility is log normal, but this distribution has support on the positive reals, and your data has zeros.
Another possibility is the Weibull distribution:
get.params <- function(x) {
start <- list(shape=1, scale=1)
f <- fitdist(x, distr = dweibull, start=start, method='mge')
return(c(as.list(f$estimate)))
}
params <- ss[, get.params(ssi), by=.(name)]
Which seems to do a decent job dealing with the long tails:
ss[params, c('shape', 'scale'):=.(i.shape, i.scale), on=.(name)]
setorder(ss, name, ssi)
ss[, sample:=quantile(ssi, probs = ppoints(.N)), by=.(name)]
ss[, theoretical:=qweibull(ppoints(.N), shape, scale), by=.(name)]
ss[, dens:=dweibull(ssi, shape, scale), by=.(name)]
ss[, smpl.CDF:=ecdf(ssi)(ssi), by=.(name)]
ss[, theor.CDF:=pweibull(ssi, shape, scale), by=.(name)]
ggplot(ss, aes(x=ssi))+
geom_line(aes(y=theor.CDF, color=name))+
geom_point(aes(y=smpl.CDF), shape=21)+
scale_color_discrete(guide='none')+
labs(title='CDF Plots')+
facet_wrap(~name)
But even here you really should look at the q-q plots:
ggplot(ss, aes(x=theoretical, y=sample))+
geom_point(aes(color=name))+
geom_abline(color='blue', linetype='dashed')+
scale_color_discrete(guide='none')+
labs(title='QQ Plots')+
facet_wrap(~name, scales = 'free_x')
which really highlight the extreme tails. IMO the chi-sq gof metrics are pretty useless in this situation.