Search code examples
rlinear-regression

R glm summary lists every value of independent variable


I'm running a glm in r on a dataframe with 2 values.

str(INV)
'data.frame':   5614 obs. of  2 variables:
 $ MSACode: Factor w/ 70 levels "40","80","440",..: 37 64 58 56 66 14 38 37 66 14 ...
 $ NotPaid: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...

The code I used to run it:

GlmModel <- glm(NotPaid ~ MSACode,family=binomial(link="logit"),data=training)
print(summary(GlmModel))

The result from the summary is showing the individual values rather than just one value for the field.

> print(summary(GlmModel))

Call:
glm(formula = NotPaid ~ MSACode, family = binomial(link = "logit"), 
    data = training)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.9728  -0.8352  -0.6501   0.9346   2.8245  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.657e+01  1.697e+03  -0.010    0.992
MSACode80    1.462e+01  1.697e+03   0.009    0.993
MSACode440  -7.494e-07  1.924e+03   0.000    1.000
MSACode520   1.547e+01  1.697e+03   0.009    0.993
MSACode640   1.587e+01  1.697e+03   0.009    0.993
MSACode720   1.477e+01  1.697e+03   0.009    0.993
MSACode870   1.657e+01  1.697e+03   0.010    0.992
MSACode1080  1.455e+01  1.697e+03   0.009    0.993

I don't understand these results - why is it showing each MSACode value separately? Thanks.


Solution

  • I'm sure this is a duplicate, but can't find it.

    The problem is that, because MSACode is a factor (possibly because of a value in that column of an input file that couldn't be interpreted as numeric), R is assuming you want to treat it as a categorical rather than as a continuous predictor — hence, it gives you n-1 parameters (where n is the number of levels) rather than 1 to describe its effect. You can convert it back to numeric by:

    INV <- transform(INV, 
        MSACode = as.numeric(as.character(MSACode)))
    

    and then re-run your model. (This post explains why we need as.numeric(as.character(.)) rather than as.numeric(), and explains that as.numeric(levels(f))[f] is more efficient — although I rarely bother worrying about that level of efficiency ...)