Search code examples
rinteractiondummy-variable

Singularity in interacting categorical variables in r


I'm trying to estimate a model which has many interacting categorical variables. However, I get singularity errors when I do OLS. I'm trying to figure out why. I think I'm doing something wrong with setting variables in R.

the model is like below.

Income ~ Gender + Age + Employed:Jobtype + Employed:Workdays + Employed:Position

Here, dependent variable is Income, and Interacting categorical variables are Employed, Jobtype, Workdays, Position.

  • Employed variable is coded 0 = Unemployed, 1 = Employed.
  • Jobtype variable is coded 0 = Unemployed, 1 = Service , 2 = Salesman.
  • Workdays variable is coded 0 = Unemployed, 1 = 5 days a week, 2 = 6 days a week, 3 = 7 days a week.
  • Position variable is coded 0 = Unemployed, 1 = Temporary, 2 = Permanent.

As you can see, baseline for all categorical variable is 0 = Unemployed. I want the baseline to be 'Unemployed' because I want to see the effect of each interacted variable compared to unemployed people.

I removed the main effect of Employed because I only want to see the interaction effect.

However when I regress, I get many singularities(in only interaction terms).

The three main questions I have are,

First, for interacting dummy variable, is there difference between using factor variable and using numerical variable with coding "0" and "1" ?

I searched and learned that for normal estimation, you can normally just set variable as factors and R will automatically create dummy variables for the estimation, so it is same as coding numbers manually. However in this case, result differs between setting Employed variable as factors and setting it as numerical variable with values 0 and 1. (if I set it as a factor variable, singularity variable increases)

Second, Is it OK to have interaction with numerically coded dummy variable and factor variable?

I set Employed variable as numeric dummy variable, and it is interacting with factor variables Jobtype, Workdays, Position. Can this cause problem?

Finally, Is there any possible reasons that I'm getting Singularity problems?

I'm guessing that setting all variables' baseline as 0 = Unemployed is causing the problem, but I'm not sure. And while I'm setting 0 = Unemployed as the baseline, regression result shows interaction of baseline variables. I thought that baseline variables are not supposed to be shown in the regression result table(because it is already incorporated in the intercept). Why is this so?

Below are reproducible code.

Income <- c(100, 150, 20, 30, 40, 60, 70, 50)
Gender <- as.factor(c(0, 1, 1, 1, 0, 1, 0, 1)) # 0 = Man, 1 = Woman
Age <- c(54, 35, 24, 43, 23, 50, 66, 54)
Employed <- c(1, 0, 0, 1, 0, 0, 1, 1) # 0 = Unemployed, 1 = Employed
Jobtype <- as.factor(c(1, 0, 0, 2, 0, 0, 2, 1)) # 0 = Unemployed, 1 = Service,                2 = Salesman
Workdays <- as.factor(c(1, 0, 0, 2, 0, 0, 3, 2)) # 0 = Unemployed, 1 = 5 days     a week, 2 = 6 days a week, 3 = 7 days a week.
Position <- as.factor(c(1, 0, 0, 2, 0, 0, 1, 1)) #0 = Unemployed, 1 =     Temporary, 2 = Permanent

data <- data.frame(Income, Gender, Age, Employed, Jobtype, Workdays, Position)

reg <- lm(Income ~ Gender + Age + Employed:(Jobtype + Workdays + Position),     data = data)
summary(reg) # regression with numerically coded Employed variable.

data$Employed <- as.factor(data$Employed)
reg2 <- lm(Income ~ Gender + Age + Employed:(Jobtype + Workdays + Position),     data = data)
summary(reg2) # regression with Employed variable as factor variable.

Solution

  • From your description, it looks like the information in Employed is also contained in Jobtype, Workdays, and Position. You can see this easily, when you recode Jobtype to be 0 if unemployed and 1 else; the recoded values are identical to the values in Employed:

    ifelse(Jobtype == 0, 0, 1) == Employed
    # [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
    

    Therefore, even without an interaction, there would be singularity issues if you include Employment with the other variables. I would suggest to simply omit Employment from the model:

    m1 <- lm(Income ~ Gender + Age + Jobtype + Workdays + Position, data = data)
    summary(m1)
    

    There are still some factor levels where R fails to compute coefficients, people that work seven days a week, and hold temporary and permanent positions. That seems to be because there is just not enough observations. But I assume you have more than 8 observations, so that should start disappearing, if you run it on the entire data set.

    Another issue is your use of factor variables. While not wrong, you fall short of actually using the capabilities of R here. I'd suggest that you actually label the data to make your life interpreting the results easier:

    Gender <- factor(c(0, 1, 1, 1, 0, 1, 0, 1), levels = 0:1,
                     labels = c("Man", "Woman"))
    Jobtype <- factor(c(1, 0, 0, 2, 0, 0, 2, 1), levels = 0:2,
                      labels = c("Unemployed", "Service", "Salesman"))
    Workdays <- factor(c(1, 0, 0, 2, 0, 0, 3, 2), levels = c(0:3),
                       labels = c("Unemployed", "5 days", "6 days", "7 days"))
    Position <- factor(c(1, 0, 0, 2, 0, 0, 1, 1), levels = c(0:2),
                       labels = c("Unemployed", "Temporary", "Permanent"))
    

    Now, if you bind these correctly labeled data together and run the model again, the output becomes a little nicer.

    data2 <- data.frame(Income, Gender, Age, Employed, Jobtype, Workdays, Position)
    m2 <- lm(Income ~ Gender + Age + Jobtype + Workdays + Position, data = data2)
    summary(m2)
    
    # Coefficients: (3 not defined because of singularities)
    #                   Estimate Std. Error t value Pr(>|t|)
    # (Intercept)         14.795    146.938   0.101    0.936
    # GenderWoman         22.055    125.261   0.176    0.889
    # Age                  1.096      4.983   0.220    0.862
    # JobtypeService      -9.178    325.920  -0.028    0.982
    # JobtypeSalesman    -17.123    250.637  -0.068    0.957
    # Workdays5 days      35.205    216.710   0.162    0.897
    # Workdays6 days     -36.849    246.912  -0.149    0.906
    # Workdays7 days          NA         NA      NA       NA
    # PositionTemporary       NA         NA      NA       NA
    # PositionPermanent       NA         NA      NA       NA
    

    Now to your specific questions:

    1. Yes, it's a big difference if you enter a variable into a regression model as numeric or as factor variable. R has no way of telling if a numeric variable is actually something that is not on a numeric scale of measurement. Hence, it will treat a numeric variable always as such. Therefore: always make sure that variables that are nominally or ordinally scaled are entered as factors.

    2. In terms of R a numerically coded dummy variable has to be a factor variable (or logical) that has it's labels set to 0 and 1, respectively. It is ok to do that, just not really helpful when interpreting the results.

    3. The singularity issues seem to come from the fact that Employed and the other three variables contain the same information. If you can trivially recreate a variable from another, then that's usually a bad sign. Another source for your missing coefficients is the small number of cases.