I understand that I need one dummy variable less than the total number of dummy variables. However, I am stuck, because I keep receiving the error: "1 not defined because of singularities", when running lm in R. I found a similar question here: What is causing this error? Coefficients not defined because of singularities but it is slightly different than my problem.
I have two treatments (1) "benefit" and (2) "history", with two Levels each (1) "low" and "high" and (2) "short" and "long", i.e. 4 combinations. Additionally, I have a Control Group, which was exposed to neither. Therefore, I coded 4 dummy variables (which is one less than the total number of Groups n=5). Followingly, the dummy coded data looks like this:
low benefit high benefit short history long history
Control group 0 o 0 0
low benefit, short history 1 0 1 0
low benefit, long history 1 0 0 1
high benefit, short history 0 1 1 0
high benefit, long history 0 1 0 1
When I run my lm I get this:
Model:
summary(lm(X ~ short history + high benefit + long history + low benefit + Control variables, data = df))
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.505398100 0.963932438 5.71139 4.8663e-08 ***
Dummy short history 0.939025772 0.379091565 2.47704 0.0142196 *
Dummy high benefit -0.759944023 0.288192645 -2.63693 0.0091367 **
Dummy long history 0.759352915 0.389085599 1.95163 0.0526152 .
Dummy low benefit NA NA NA NA
Control Varibales xxx xxx xxx xxx
This error occurs always for the dummy varibale at the 4th Position. The Control variables are all calculated without problem.
I already tried to only include two variables with two levels, meaning for "history" I coded, 1 for "long" and 0 for "short", and for "benefit", 1 for "high" and 0 for "low". This way, the lm worked, but the problem is, that the Control Group and the combination "short history, low benefit" are coded identically, i.e. 0 and 0 for both variables.
I am sorry, if this is a basic mistake but I have not been able to figure it out. If you need more information, please say so. Thanks in advance.
As I put in the comments you only have two variables, if you make them factors and check the contrasts r
will do the right thing. Please also see http://www.sthda.com/english/articles/40-regression-analysis/163-regression-with-categorical-variables-dummy-coding-essentials-in-r/
Make up data representative of yours.
set.seed(2020)
df <- data.frame(
X = runif(n = 120, min = 5, max = 15),
benefit = rep(c("control", "low", "high"), 40),
history = c(rep("control", 40), rep("long", 40), rep("short", 40))
)
Make benefit
and history
factors, check that control is the base contrast for each.
df$benefit <- factor(df$benefit)
df$history <- factor(df$history)
contrasts(df$benefit)
#> high low
#> control 0 0
#> high 1 0
#> low 0 1
contrasts(df$history)
#> long short
#> control 0 0
#> long 1 0
#> short 0 1
Run the regression and get the summary. 4 coefficient all conpared to control/control.
lm(X ~ benefit + history, df)
#>
#> Call:
#> lm(formula = X ~ benefit + history, data = df)
#>
#> Coefficients:
#> (Intercept) benefithigh benefitlow historylong historyshort
#> 9.94474 -0.08721 0.11245 0.37021 -0.35675
summary(lm(X ~ benefit + history, df))
#>
#> Call:
#> lm(formula = X ~ benefit + history, data = df)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -5.4059 -2.3706 -0.0007 2.4986 4.7669
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 9.94474 0.56786 17.513 <2e-16 ***
#> benefithigh -0.08721 0.62842 -0.139 0.890
#> benefitlow 0.11245 0.62842 0.179 0.858
#> historylong 0.37021 0.62842 0.589 0.557
#> historyshort -0.35675 0.62842 -0.568 0.571
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 2.809 on 115 degrees of freedom
#> Multiple R-squared: 0.01253, Adjusted R-squared: -0.02182
#> F-statistic: 0.3648 on 4 and 115 DF, p-value: 0.8333