Search code examples
rpolynomial-math

Why I() "AsIs" is necessary when making a linear polynomial model in R?


I'm trying to understand what is the role of I() base function in R when using a linear polynomial model or the function poly. When I calculate the model using

  1. q + q^2
  2. q + I(q^2)
  3. poly(q, 2)

I have different answers.

Here is an example:

set.seed(20)
q <- seq(from=0, to=20, by=0.1)
y <- 500 + .1 * (q-5)^2
noise <- rnorm(length(q), mean=10, sd=80)
noisy.y <- y + noise
model3 <- lm(noisy.y ~ poly(q,2))
model1 <- lm(noisy.y ~ q + I(q^2))
model2 <- lm(noisy.y ~ q + q^2)
I(q^2)==I(q)^2
I(q^2)==q^2

summary(model1)
summary(model2)
summary(model3)

Here is the output:

> summary(model1)

Call:
lm(formula = noisy.y ~ q + I(q^2))

Residuals:
     Min       1Q   Median       3Q      Max 
-211.592  -50.609    4.742   61.983  165.792 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 489.3723    16.5982  29.483   <2e-16 ***
q             5.0560     3.8344   1.319    0.189    
I(q^2)       -0.1530     0.1856  -0.824    0.411    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 79.22 on 198 degrees of freedom
Multiple R-squared:  0.02451,   Adjusted R-squared:  0.01466 
F-statistic: 2.488 on 2 and 198 DF,  p-value: 0.08568

> summary(model2)

Call:
lm(formula = noisy.y ~ q + q^2)

Residuals:
    Min      1Q  Median      3Q     Max 
-219.96  -54.42    3.30   61.06  170.79 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 499.5209    11.1252  44.900   <2e-16 ***
q             1.9961     0.9623   2.074   0.0393 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 79.16 on 199 degrees of freedom
Multiple R-squared:  0.02117,   Adjusted R-squared:  0.01625 
F-statistic: 4.303 on 1 and 199 DF,  p-value: 0.03933

> summary(model3)

Call:
lm(formula = noisy.y ~ poly(q, 2))

Residuals:
     Min       1Q   Median       3Q      Max 
-211.592  -50.609    4.742   61.983  165.792 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  519.482      5.588  92.966   <2e-16 ***
poly(q, 2)1  164.202     79.222   2.073   0.0395 *  
poly(q, 2)2  -65.314     79.222  -0.824   0.4107    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 79.22 on 198 degrees of freedom
Multiple R-squared:  0.02451,   Adjusted R-squared:  0.01466 
F-statistic: 2.488 on 2 and 198 DF,  p-value: 0.08568

Why is the I() necessary when doing a polynomial model in R.

Also, is this normal that the poly function doesn't give the same result as the q + I(q^2)?


Solution

  • The formula syntax in R is described in the ?formula help page. The ^ symbol has not been given the usual meaning of multiplicative exponentiation. Rather, it's used for interactions between all terms at the base of the exponent. For example

    y ~ (a+b)^2
    

    is the same as

    y ~ a + b + a:b
    

    But if you do

    y ~ a + b^2
    y ~ a + b    # same as above, no way to "interact" b with itself.
    

    That caret would just include the b term because it can't include the interaction with itself. So ^ and * inside formulas has nothing to do with multiplication just like the + doesn't really mean addition for variables in the usual sense.

    If you want the "usual" definition for ^2 you need to put it the as is function. Otherwise it's not fitting a squared term at all.

    And the poly() function by default returns orthogonal polynomials as described on the help page. This helps to reduce co-linearity in the covariates. But if you don't want the orthogonal versions and just want the "raw" polynomial terms, then just pass raw=TRUE to your poly call. For example

    lm(noisy.y ~ poly(q,2, raw=TRUE))
    

    will return the same estimates as model1