Search code examples
rinteractiondummy-variable

NA values when regressing with dummy variable interaction term


I'm trying to estimate factors that determine the difference in happiness level between people living in New York and Chicago.

Data looks like below.

  Happiness     City Gender Employment   Worktype      Holiday
1        60 New York      0        0     Unemployed   Unemployed
2        80  Chicago      1        1     Whitecolor 1 day a week
3        39  Chicago      0        0     Unemployed   Unemployed
4        40 New York      1        0     Unemployed   Unemployed
5        69  Chicago      1        1     Bluecolor  2 day a week
6        90  Chicago      1        1     Bluecolor  2 day a week
7       100 New York      0        1     Whitecolor 2 day a week
8        30 New York      1        1     Whitecolor 1 day a week

Happiness level is dependent variable, and 'city' is where the person lives. 'Gender' is coded 0 = man 1 = woman. 'Employment' is 0 = Unemployed and 1 = Employed. 'Worktype' is three level factor: 'Unemployed', 'Whitecolor', 'Bluecolor'. 'Holiday' is how many days a person rest in a week. Here 'City', 'Gender', 'Worktype' and 'Holiday' variables are all factors. 'Happiness' and 'Employment' variable types are numerical.

The Model I want to estimate is

lm(Happiness ~ City + Gender + Employment:(Worktype + Holiday))

I left 'Employment' value as numerical value so if 'Employment' is equal to 0(Unemployed), 0:(Worktype + Holiday) = 0, so the model is automatically reduced to

lm(Happiness ~ City + Gender)

for unemployed people.

However, regression result returns NA values.

Coefficients: (2 not defined because of singularities)
                               Estimate Std. Error t value Pr(>|t|)
(Intercept)                       56.75      23.56   2.408    0.138
CityNew York                     -14.50      27.21  -0.533    0.647
Gender1                           -2.25      35.99  -0.063    0.956
Employment:WorktypeBluecolor      25.00      43.02   0.581    0.620
Employment:WorktypeUnemployed        NA         NA      NA       NA
Employment:WorktypeWhitecolor     57.75      35.99   1.604    0.250
Employment:Holiday1 day a week   -50.00      54.42  -0.919    0.455
Employment:Holiday2 day a week       NA         NA      NA       NA

this seems to be due to 'Unemployment' value in 'Worktype' and 'Holiday' variable. However, I am not sure why R is not treating Employment:WorktypeUnemployed which is obviously 0:Worktype = 0 as zero and not removing it from the model. Is this because R is setting Employment:HolidayUnemployed as a baseline and both are perfectly multicollinear? (I had to put 'Unemployed' value for 'Worktype' and 'Holiday' because I wanted to see the effect of 'Worktype' and 'Holiday' compared to 'Unemployed' people. If I remove 'Unemployed' value NA disappears, but baseline will be 'Whitecolor' and '1day a week' so I cannot see the effect compared to 'unemployed'.)

If so, Why am I getting NA for coefficients for 'Employement:Holiday2 day a week'? It seems that it has nothing to do with 'Unemployed' value.

Can I rely on this result while just removing NA coefficients?

below are reproducible code.

Happiness <- c(60, 80, 39, 40, 69, 90, 100, 30)

City <- as.factor(c("New York", "Chicago", "Chicago", "New York", "Chicago",         
                  "Chicago", "New York", "New York"))
Gender <- as.factor(c(0, 1, 0, 1, 1, 1, 0, 1)) # 0 = man, 1 = woman.
Employment <- c(0,1, 0, 0, 1 ,1 , 1 , 1) # 0 = unemployed, 1 = employed.
Worktype <- as.factor(c("Unemployed", "Whitecolor", "Unemployed",     
          "Unemployed", "Bluecolor", "Bluecolor", "Whitecolor","Whitecolor"))
Holiday <- as.factor(c(0, 1, 0, 0, 2, 2, 2, 1))
levels(Holiday) <- c("Unemployed", "1 day a week", "2 day a week")

data <- data.frame(Happiness, City, Gender, Employment, Worktype, Holiday)

head(data,8)
str(data)

reg <- lm(Happiness ~ City + Gender + Employment:(Worktype + Holiday))
summary(reg)

Solution

  • You shouldn't worry about the NA values for Employment:WorktypeUnemployed. R tries automatically to compute all the interactions, but that particular coefficient remains undetermined because, clearly, it is never the case that Employment=1 and Worktype="Unemployed". It does not have any effect on the computations of the other coefficients: you can verify by manually coding the dummy variables:

    > library(lme4) # for the convenient "dummy" function 
    > data <- data.frame(data, 
    +   dummy(Worktype, c("Bluecolor","Whitecolor")), 
    +   h1=dummy(Holiday)[,1], 
    +   h2=dummy(Holiday)[,2])
    >   
    > reg <- lm(Happiness ~ City + Gender + Employment:Bluecolor + Employment:Whitecolor  + Employment:h1 + Employment:h2 , data)
    > summary(reg)
    
    Call:
    lm(formula = Happiness ~ City + Gender + Employment:Bluecolor + 
        Employment:Whitecolor + Employment:h1 + Employment:h2, data = data)
    
    Residuals:
             1          2          3          4          5          6          7          8 
     1.775e+01  1.775e+01 -1.775e+01  8.882e-16 -1.050e+01  1.050e+01  4.441e-15 -1.775e+01 
    
    Coefficients: (1 not defined because of singularities)
                          Estimate Std. Error t value Pr(>|t|)
    (Intercept)              56.75      23.56   2.408    0.138
    CityNew York            -14.50      27.21  -0.533    0.647
    Gender1                  -2.25      35.99  -0.063    0.956
    Employment:Bluecolor     25.00      43.02   0.581    0.620
    Employment:Whitecolor    57.75      35.99   1.604    0.250
    Employment:h1           -50.00      54.42  -0.919    0.455
    Employment:h2               NA         NA      NA       NA
    
    Residual standard error: 27.21 on 2 degrees of freedom
    Multiple R-squared:  0.6798,    Adjusted R-squared:  -0.1208 
    F-statistic: 0.8491 on 5 and 2 DF,  p-value: 0.619
    

    The estimated coefficients are identical even though Employment:WorktypeUnemployed is not present anymore.

    However, the NA values are still present for Employment:h2 (equivalent to Employment:Holiday2 day a week). This seems due to the fact that in this reduced dataset you end up with a singular model matrix (i.e. one column is a linear combination of other columns)

    > solve(crossprod(model.matrix(reg)))
    Error in solve.default(crossprod(model.matrix(reg))) : 
      system is computationally singular: reciprocal condition number = 1.79897e-18
    

    So this issue may not be present with a larger dataset. Eventually, you could try to drop any redundancy in the model (e.g., are there any employed with 0 days per week of holiday? if not then 1 day should be the baseline, and you would add extra columns to code for days of holiday >1). You can use the alias() function to check which term is giving the issue.