Search code examples
rregressionlinear-regression

What is causing this error? Coefficients not defined because of singularities


I'm trying to find a model for my data but I get the message "Coefficients: (3 not defined because of singularities)" These occur for winter, large and high_flow

I found this: https://stats.stackexchange.com/questions/13465/how-to-deal-with-an-error-such-as-coefficients-14-not-defined-because-of-singu

which said it may be incorrect dummy variables, but I've checked that none of my columns are duplicates.

when I use the function alias() I get:

Model :
S ~ A + B + C + D + E + F + G + spring + summer + autumn + winter + small + medium + large + low_flow + med_flow + high_flow

Complete :
          (Intercept) A  B  C  D  E  F  G  spring summer autumn small medium
winter     1           0  0  0  0  0  0  0 -1     -1     -1      0     0    
large      1           0  0  0  0  0  0  0  0      0      0     -1    -1    
high_flow  1           0  0  0  0  0  0  0  0      0      0      0     0    
          low_flow med_flow
winter     0        0      
large      0        0      
high_flow -1       -1      

columns A-H of my data contain numeric values the remaining columns take 0 or 1, and I have checked there are no conflicting values (i.e. if spring = 1 for a case, autumn=summer=winter=0)

model_1 <- lm(S ~ A+B+C+D+E+F+G+spring+summer+autumn+winter+small+medium+large+low_flow+med_flow+high_flow, data = trainOne)
summary(model_1)

Can someone explain the error please?

EDIT: example of my data before I changed it to binary

season  size   flow  A  B   C   D   E   F   G  S
spring small  medium 52 72 134  48 114 114 142 11
autumn small  medium 43 21  98 165 108  23  60 31
spring medium medium 41 45 161  86 177 145  32 12
autumn large  medium 40 86 132  80  82 138 186 16
winter medium  high  49 32 147 189 125  43 144 67
summer large   high  43  9 158  64  14 146  15 71

Solution

  • @JuliusVainora has already given you a good explanation of how the error occurs, which I will not repeat. However, Julius' answer is only one method and might not be satisfying if you don't understand that there really is a value for cases where winter = 1, large=1 and high_flow=1. It can readily be seen in the display as the value for "(Intercept)". You might be able to make the result more interpretable by adding +0 to your formula. (Or it might not, depending on the data situation.)

    However, I think that you really should re-examine how your coding of categorical variables is done. You are using a method of one dummy variable per level that you are copying from some other system, perhaps SAS or SPSS? That's going to predictably cause problems for you in the future, as well as being a painful method to code and maintain. R's data.frame function already automagically creates factor's that encode multiple levels in a single variable. (Read ?factor.) So your formula would become:

     S ~ A + B + C + D + E + F + G + season + size + flow
    

    I think SAS and maybe SPSS uses a system where the global mean is used as the reference level. (Could be wrong about that. Haven't used them since 1997.) In that situation there would be coefficients for the contrasts of all levels, but you would need to interpret those values a bit differently. You then need to calculate differences in coefficients to arrive at the values of any contrast. By using "treatment contrasts, the coefficients themselves are immediately interpretable.