Search code examples
rregressionlinear-regressionglm

How do I interpret `NA` coefficients from a GLM fit with the quasipoisson family?


I'm fitting a model in R using the quasipoisson family like this:

model <- glm(y ~ 0 + log_a + log_b + log_c + log_d + log_gm_a + 
   log_gm_b + log_gm_c + log_gm_d, family = quasipoisson(link = 'log'))

glm finds values for the first five coefficients. It says the others are NA. Interestingly, if I reorder the variables in the formula, glm always finds coefficients for the five variables that appear first in the formula.

There is sufficient data (the number of the rows is many times the number of parameters).

How should I interpret those NA coefficients?

The author of the model I'm implementing insists that the NAs imply that the found coefficients are 0, but the NA-coefficient variables are still acting as controls over the model. I suspect something else is going on.


Solution

  • My guess is that the author (who says "the NAs imply that the found coefficients are 0, but the NA-coefficient variables are still acting as controls over the model") is wrong (although it's hard to be 100% sure without having the full context).

    The problem is almost certainly that you have some multicollinear predictors. The reason that different variables get dropped/have NA coefficients returned is that R partly uses the order to determine which ones to drop (as far as the fitted model result goes, it doesn't matter - all of the top-level results (predictions, goodness of fit, etc.) are identical).

    In comments the OP says:

    The relationship between log_a and log_gm_a is that this is a multiplicative fixed-effects model. So log_a is the log of predictor a. log_gm_a is the log of the geometric mean of a. So each of the log_gm terms is constant across all observations.

    This is the key information needed to diagnose the problem. Because the intercept is excluded from this model (the formula contains 0+, having one constant column in the model matrix is OK, but multiple constant columns is trouble; all but the first (in whatever order is specified by the formula) will be discarded. To go slightly deeper: the model requested is

    Y = b1*C1 + b2*C2 + b3*C3 + [additional terms]
    

    where C1, C2, C3 are constants. At the point in "data space" where the additional terms are 0 (i.e. for cases where log_a = log_b = log_c = ... = 0), we're left with predicting a constant value from three separate constant terms. Suppose that the intercept in a regular model (~ 1 + log_a + log_b + log_c) would have been m. Then any combination of (b1, b2, b3) that makes the sum equal to zero (and there are infinitely many) will fit the data equally well.

    I still don't know much about the context, but it might be worth considering adding the constant terms as offsets in the model. Or scale the predictors by their geometric means/subtract the log-geom-means from the predictors?


    In other cases, multicollinearity arises from unidentifiable interaction terms; nested variables; attempts to include all the levels of multiple categorical variables; or including the proportions of all levels of some compositional variable (e.g. proportions of habitat types, where the proportions add up to 1) in the model, e.g.