Search code examples
rlogistic-regression

"contrasts can be applied only to factors with 2 or more levels" error when 2 or more levels do exist (R)


I have datasets from https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data There are two factors with 2 or more levels, plus one target value, SalePrice.

  Street      Alley        SalePrice     
 Grvl:   6   Grvl:  50   Min.   : 34900  
 Pave:1454   Pave:  41   1st Qu.:129975  
             NA's:1369   Median :163000  
                         Mean   :180921  
                         3rd Qu.:214000  
                         Max.   :755000 

When running linear regression separately on the two factors, it runs fine.

> summary(lm(SalePrice ~ Street, data=train))

Call:
lm(formula = SalePrice ~ Street, data = train)

Residuals:
    Min      1Q  Median      3Q     Max 
-146231  -51131  -18131   32869  573869 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   130190      32416   4.016 6.21e-05 ***
StreetPave     50940      32483   1.568    0.117    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 79400 on 1458 degrees of freedom
Multiple R-squared:  0.001684,  Adjusted R-squared:  0.0009992 
F-statistic: 2.459 on 1 and 1458 DF,  p-value: 0.117

> summary(lm(SalePrice ~ Alley, data=train))

Call:
lm(formula = SalePrice ~ Alley, data = train)

Residuals:
    Min      1Q  Median      3Q     Max 
-128001  -17001    1781   16999  133781 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   122219       5153  23.718  < 2e-16 ***
AlleyPave      45782       7677   5.963  4.9e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 36440 on 89 degrees of freedom
  (1369 observations deleted due to missingness)
Multiple R-squared:  0.2855,    Adjusted R-squared:  0.2775 
F-statistic: 35.56 on 1 and 89 DF,  p-value: 4.9e-08

However, when running together, it results in error, which doesn't make sense.

> summary(lm(SalePrice ~ Street+Alley, data=train))
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : 
  contrasts can be applied only to factors with 2 or more levels

Can someone help on this?


Solution

  • I got a hint from this line in the question: (1369 observations deleted due to missingness)

    In lm, missing values are simply deleted. While running lm on Street and Alley, NA's were deleted due to Alley, resulting in single value for Street factor.

    > train[!is.na(Alley), Street]
     [1] Pave Pave Pave Pave Pave Pave Pave Pave Pave Pave Pave Pave Pave Pave Pave
    [16] Pave Pave Pave Pave Pave Pave Pave Pave Pave Pave Pave Pave Pave Pave Pave
    [31] Pave Pave Pave Pave Pave Pave Pave Pave Pave Pave Pave Pave Pave Pave Pave
    [46] Pave Pave Pave Pave Pave Pave Pave Pave Pave Pave Pave Pave Pave Pave Pave
    [61] Pave Pave Pave Pave Pave Pave Pave Pave Pave Pave Pave Pave Pave Pave Pave
    [76] Pave Pave Pave Pave Pave Pave Pave Pave Pave Pave Pave Pave Pave Pave Pave
    [91] Pave
    Levels: Grvl Pave