Search code examples
rglm

How to work out theta value of my data for use in negative binomial GLM?


I am trying to do a GLM for a count dataset, and have found that my data is overdispersed and so, not suitable to use a poisson GLM on. I am aware that I have to use a negative binomial GLM instead, which requires a theta value. However, when I try to run a summary of my model I get a series of errors below and it can't find the theta value. Any help on this would be appreciated. I will put a summary of my dataset and my code used to produce a summary of the model and the errors below.

Dataset summary:

enter image description here

The data used for the GLM is total (Count data) and treatment (which is a letter representing different treatments eg. C, M, F)

Code used to produce theta:

    summary(m1 <- glm.nb(Total ~ Treatment, data = twohour))

Output of this code with errors at the bottom:

enter image description here

Any help with producing a theta value would be greatly appreciated. Thanks in advance.

As requested, the summary and model outputs as text:

summary:

> summary(twohour)


  Treatment        |     Length         |        ID          |   Block1          Block2       |   Fertility      |    Notes         |       Total      
 Length:252   |       Length:252     |     Min.   : 1.00   | Min.   :  0.0   Min.   :  0.00   Min.   :0.0000   Length:252         Min.   :  0.0  
 Class :character   Class :character   1st Qu.:10.00   1st Qu.:125.8   1st Qu.: 39.50   1st Qu.:1.0000   Class :character   1st Qu.:172.2  
 Mode  :character   Mode  :character   Median :19.50   Median :154.0   Median :104.50   Median :1.0000   Mode  :character   Median :263.0  
                                       Mean   :19.89   Mean   :143.5   Mean   : 94.66   Mean   :0.9683                      Mean   :238.1  
                                       3rd Qu.:30.00   3rd Qu.:179.2   3rd Qu.:146.00   3rd Qu.:1.0000                      3rd Qu.:309.5  
                                       Max.   :40.00   Max.   :227.0   Max.   :228.00   Max.   :1.0000                      Max.   :434.0 

model output:

> Call: glm.nb(formula = Total ~ Treatment, data = twohour, init.theta =
> 2055605.705, 
>     link = log)
> 
> Deviance Residuals: 
>     Min       1Q   Median       3Q      Max  
> -23.001   -4.624    1.650    4.567   12.571  
> 
> Coefficients:
               Estimate Std. Error z value Pr(>|z|)     (Intercept)   5.577987   0.009846 566.534  < 2e-16 *** TreatmentC   -0.102625   0.014394  -7.130 1.01e-12 *** TreatmentF   -0.154580   0.014396 -10.737  < 2e-16 *** TreatmentF30 -0.298972   0.019920 -15.008  < 2e-16 *** TreatmentM   -0.158733   0.014613 -10.862  < 2e-16 ***
 TreatmentM30 -0.044795   0.013992  -3.201  0.00137 **  TreatmentMxF
 -0.105191   0.014211  -7.402 1.34e-13 ***
 --- Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
 (Dispersion parameter for Negative Binomial(2055606) family taken to
 be 1)
 
    Null deviance: 15127  on 251  degrees of freedom Residual deviance: 14799  on 245  degrees of freedom AIC: 16542

Number of Fisher Scoring iterations: 1

Error in prettyNum(.Internal(format(x, trim, digits, nsmall, width,
3L,  :    invalid 'nsmall' argument In addition: Warning messages: 

1: In theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =control$trace  :   iteration limit reached 

2: In sqrt(1/i) : NaNs produced 

3: In theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace = control$trace >  :   iteration limit reached 

4: In sqrt(1/i) : NaNs produced

Solution

  • tl;dr I suspect this is driven by outliers, especially (??) some zero values that are inconsistent with the rest of the data set. If the zero values aren't errors/weird cases, you might consider a zero-inflated model ...???

    We probably need your data to know for sure what's going on. Here's what I can gather so far:

    • some aspects of your results look like underdispersion (ridiculously large theta estimate, "iteration limit reached" warnings ...
    • ... yet I agree that your data seem to be overdispersed (large ratio of residual deviance to residual df; range from 0 to 434 with a mean of 238)
    • ... the extreme range of the deviance residuals (-23 to +12) suggest outliers (the deviance residuals are essentially on the log scale ...)

    I can get most of the way by constructing a data set that's mostly Poisson, but with a few extreme outliers:

    n <- 252     ## total number of obs
    ng <- 7      ## number of groups/treatments
    mu <- exp(6)    ## mean response
       ## NOTE: this doesn't match your data, I did it accidentally,
       ##  but it does reproduce the errors.
    set.seed(101)
    dd <- data.frame(
        ## mostly Poisson, but with 2 values at the min and max values
        y = c(rpois(n-4, lambda=mu), rep(c(0,434), each=2)),
        f = factor(rep(1:ng, length.out = n))
    )
    summary(dd)
    library(MASS)
    m2 <- glm.nb(y~f, data = dd)
    

    (the zero values look like the largest problem. I can reproduce the problem with 2 (but not 1) zero outliers, with the rest of the data Poisson with a large mean ...)

    Warning messages:
    1: In theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace = control$trace >  :
      iteration limit reached
    2: In sqrt(1/i) : NaNs produced
    3: In theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace = control$trace >  :
      iteration limit reached
    4: In sqrt(1/i) : NaNs produced
    

    Results:

    Call:
    glm.nb(formula = y ~ f, data = dd, init.theta = 12474197.56, 
        link = log)
    
    Deviance Residuals: 
         Min        1Q    Median        3Q       Max  
    -28.1363   -0.4887    0.0444    0.7153    3.4771  
    
    Coefficients:
                  Estimate Std. Error z value Pr(>|z|)    
    (Intercept)  6.0005206  0.0082958 723.319  < 2e-16 ***
    f2           0.0006192  0.0117302   0.053  0.95790    
    f3           0.0015129  0.0117276   0.129  0.89736    
    f4          -0.0328793  0.0118297  -2.779  0.00545 ** 
    f5          -0.0195274  0.0117898  -1.656  0.09766 .  
    f6           0.0068583  0.0117120   0.586  0.55816    
    f7          -0.0087784  0.0117579  -0.747  0.45531    
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    (Dispersion parameter for Negative Binomial(12474198) family taken to be 1)
    
        Null deviance: 1837.2  on 251  degrees of freedom
    Residual deviance: 1820.0  on 245  degrees of freedom
    AIC: 3795.6
    
    Number of Fisher Scoring iterations: 1
    

    Error in prettyNum(.Internal(format(x, trim, digits, nsmall, width, 3L, : invalid 'nsmall' argument

    A little digging shows that this particular error is caused because the standard error of the theta estimate can't be computed (it's NaN) ...

    Looking at the diagnostics (plot(m2)) shows the outliers clearly:

    diagnostic plot for NB: observations 249 and 250 are big outliers

    The following works fine (more or less: it gives ridiculous theta estimates because the data are not overdispersed once zero-inflation is accounted for).

    library(pscl)
    zeroinfl(y~f, dist="negbin",data = dd)