Search code examples
rlmfactors

Linear model with categorical variables in R


I am trying to fit a lineal model with some categorical variables

model <- lm(price ~ carat+cut+color+clarity)
summary(model)

The answer is:

Call:
lm(formula = price ~ carat + cut + color + clarity)

Residuals:
     Min       1Q   Median       3Q      Max 
-11495.7   -688.5   -204.1    458.2   9305.3 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -3696.818     47.948 -77.100  < 2e-16 ***
carat        8843.877     40.885 216.311  < 2e-16 ***
cut.L         755.474     68.378  11.049  < 2e-16 ***
cut.Q        -349.587     60.432  -5.785 7.74e-09 ***
cut.C         200.008     52.260   3.827 0.000131 ***
cut^4          12.748     42.642   0.299 0.764994    
color.L      1905.109     61.050  31.206  < 2e-16 ***
color.Q      -675.265     56.056 -12.046  < 2e-16 ***
color.C       197.903     51.932   3.811 0.000140 ***
color^4        71.054     46.940   1.514 0.130165    
color^5         2.867     44.586   0.064 0.948729    
color^6        50.531     40.771   1.239 0.215268    
clarity.L    4045.728    108.363  37.335  < 2e-16 ***
clarity.Q   -1545.178    102.668 -15.050  < 2e-16 ***
clarity.C     999.911     88.301  11.324  < 2e-16 ***
clarity^4    -665.130     66.212 -10.045  < 2e-16 ***
clarity^5     920.987     55.012  16.742  < 2e-16 ***
clarity^6    -712.168     52.346 -13.605  < 2e-16 ***
clarity^7    1008.604     45.842  22.002  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1167 on 4639 degrees of freedom
Multiple R-squared:  0.9162,    Adjusted R-squared:  0.9159 
F-statistic:  2817 on 18 and 4639 DF,  p-value: < 2.2e-16

But I don't understand why the answers are with ".L,.Q,.C,^4, ...", something is wrong but I don't know what is wrong, I already tried with the function factor for each variable.


Solution

  • You are encountering how “ordered” ( ordinal ) factor variables are handled by regression functions and the default set of contrasts are orthogonal polynomial contrasts up to degree n-1, where n is the number of levels for that factor. It's not going to be very easy to interpret that result ... especially if there is no natural order. Even if there is, and there might well be in this case, you might not want the default ordering (which is alphabetical by factor level) and you probably don't want to have more than a few of degrees in the polynomial contrasts.

    In the case of ggplot2's diamonds dataset, the factor levels are set up correctly but most newbies when they stumble across ordered factors get ordered levels like "Excellent" <"Fair" < "Good"< "Poor". (Fail)

    > levels(diamonds$cut)
    [1] "Fair"      "Good"      "Very Good" "Premium"   "Ideal"    
    > levels(diamonds$clarity)
    [1] "I1"   "SI2"  "SI1"  "VS2"  "VS1"  "VVS2" "VVS1" "IF"  
    > levels(diamonds$color)
    [1] "D" "E" "F" "G" "H" "I" "J"
    

    One methid to use ordered factors when they have been set up correctly is to just wrap them in as.numeric which gives you a linear test of trend.

    > contrasts(diamonds$cut) <- contr.treatment(5) # Removes ordering
    > model <- lm(price ~ carat+cut+as.numeric(color)+as.numeric(clarity), diamonds)
    > summary(model)
    
    Call:
    lm(formula = price ~ carat + cut + as.numeric(color) + as.numeric(clarity), 
        data = diamonds)
    
    Residuals:
         Min       1Q   Median       3Q      Max 
    -19130.3   -696.1   -176.8    556.9   9599.8 
    
    Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
    (Intercept)         -5189.460     36.577 -141.88   <2e-16 ***
    carat                8791.452     12.659  694.46   <2e-16 ***
    cut2                  909.433     35.346   25.73   <2e-16 ***
    cut3                 1129.518     32.772   34.47   <2e-16 ***
    cut4                 1156.989     32.427   35.68   <2e-16 ***
    cut5                 1264.128     32.160   39.31   <2e-16 ***
    as.numeric(color)    -318.518      3.282  -97.05   <2e-16 ***
    as.numeric(clarity)   522.198      3.521  148.31   <2e-16 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    Residual standard error: 1227 on 53932 degrees of freedom
    Multiple R-squared:  0.9054,    Adjusted R-squared:  0.9054 
    F-statistic: 7.371e+04 on 7 and 53932 DF,  p-value: < 2.2e-16