r regression linear-regression polynomials quadratic

Polynomial regression: Why do you need to define polynomial terms outside of lm() to avoid singularities?

If you try to run a polynomial regression where x^2 is defined in the lm() function, the polynomial term is dropped due to singularities. However, if we define the polynomial term outside the lm(), the model is fit correctly.

It seems like it should work the same both ways. Why do we need to define the polynomial term outside the lm() function?

x <- round(rnorm(100, mean = 0, sd = 10))
y <- round(x*2.5 + rnorm(100))

# Trying to define x^2 in the model, x^2 is dropped
model_wrong <- lm(y ~ x + x^2)

# Define x^2 as its own object
x2 <- x^2
model_right <- lm(y ~ x + x2)

Solution

lm doesn't know where the term starts and stops within the formula unless you tell it, usually by wrapping it in a function. For arbitrary calculations, you can wrap them in I(...), which tells the function to use it as-is:

set.seed(47)
x <- round(rnorm(100, mean = 0, sd = 10))
y <- round(x*2.5 + rnorm(100))

lm(y ~ x + I(x^2))
#> 
#> Call:
#> lm(formula = y ~ x + I(x^2))
#> 
#> Coefficients:
#> (Intercept)            x       I(x^2)  
#>   2.563e-01    2.488e+00   -3.660e-06

Really, you can wrap x^2 in most any function call that will return an evaluated vector that can be used in the model matrix. In some cases cbind can be very handy, though c, identity, or even {...} will work. I is purpose-built, though.

Alternatively, you can use the poly function to make both terms for you, which is very useful for higher-degree polynomials. By default, it generates orthogonal polynomials, which will make the coefficients look different:

lm(y ~ poly(x, 2))
#> 
#> Call:
#> lm(formula = y ~ poly(x, 2))
#> 
#> Coefficients:
#> (Intercept)  poly(x, 2)1  poly(x, 2)2  
#>    1.500000   243.485357    -0.004319

even though they will evaluate the same:

new <- data.frame(x = seq(-1, 1, .5))

predict(lm(y ~ x + I(x^2)), new)
#>          1          2          3          4          5 
#> -2.2317175 -0.9876930  0.2563297  1.5003505  2.7443695

predict(lm(y ~ poly(x, 2)), new)
#>          1          2          3          4          5 
#> -2.2317175 -0.9876930  0.2563297  1.5003505  2.7443695

If you really want your coefficients to be the same, add raw = TRUE:

lm(y ~ poly(x, 2, raw = TRUE))
#> 
#> Call:
#> lm(formula = y ~ poly(x, 2, raw = TRUE))
#> 
#> Coefficients:
#>             (Intercept)  poly(x, 2, raw = TRUE)1  poly(x, 2, raw = TRUE)2  
#>               2.563e-01                2.488e+00               -3.660e-06