If you try to run a polynomial regression where x^2
is defined in the lm()
function, the polynomial term is dropped due to singularities. However, if we define the polynomial term outside the lm()
, the model is fit correctly.
It seems like it should work the same both ways. Why do we need to define the polynomial term outside the lm()
function?
x <- round(rnorm(100, mean = 0, sd = 10))
y <- round(x*2.5 + rnorm(100))
# Trying to define x^2 in the model, x^2 is dropped
model_wrong <- lm(y ~ x + x^2)
# Define x^2 as its own object
x2 <- x^2
model_right <- lm(y ~ x + x2)
lm
doesn't know where the term starts and stops within the formula unless you tell it, usually by wrapping it in a function. For arbitrary calculations, you can wrap them in I(...)
, which tells the function to use it as-is:
set.seed(47)
x <- round(rnorm(100, mean = 0, sd = 10))
y <- round(x*2.5 + rnorm(100))
lm(y ~ x + I(x^2))
#>
#> Call:
#> lm(formula = y ~ x + I(x^2))
#>
#> Coefficients:
#> (Intercept) x I(x^2)
#> 2.563e-01 2.488e+00 -3.660e-06
Really, you can wrap x^2
in most any function call that will return an evaluated vector that can be used in the model matrix. In some cases cbind
can be very handy, though c
, identity
, or even {...}
will work. I
is purpose-built, though.
Alternatively, you can use the poly
function to make both terms for you, which is very useful for higher-degree polynomials. By default, it generates orthogonal polynomials, which will make the coefficients look different:
lm(y ~ poly(x, 2))
#>
#> Call:
#> lm(formula = y ~ poly(x, 2))
#>
#> Coefficients:
#> (Intercept) poly(x, 2)1 poly(x, 2)2
#> 1.500000 243.485357 -0.004319
even though they will evaluate the same:
new <- data.frame(x = seq(-1, 1, .5))
predict(lm(y ~ x + I(x^2)), new)
#> 1 2 3 4 5
#> -2.2317175 -0.9876930 0.2563297 1.5003505 2.7443695
predict(lm(y ~ poly(x, 2)), new)
#> 1 2 3 4 5
#> -2.2317175 -0.9876930 0.2563297 1.5003505 2.7443695
If you really want your coefficients to be the same, add raw = TRUE
:
lm(y ~ poly(x, 2, raw = TRUE))
#>
#> Call:
#> lm(formula = y ~ poly(x, 2, raw = TRUE))
#>
#> Coefficients:
#> (Intercept) poly(x, 2, raw = TRUE)1 poly(x, 2, raw = TRUE)2
#> 2.563e-01 2.488e+00 -3.660e-06