I would like to fit a (very) high order regression to a set of data in R, however the poly()
function has a limit of order 25.
For this application I need an order on the range of 100 to 120.
model <- lm(noisy.y ~ poly(q,50))
# Error in poly(q, 50) : 'degree' must be less than number of unique points
model <- lm(noisy.y ~ poly(q,30))
# Error in poly(q, 30) : 'degree' must be less than number of unique points
model <- lm(noisy.y ~ poly(q,25))
# OK
Polynomials and orthogonal polynomials
poly(x)
has no hard-coded limit for degree
. However, there are two numerical constraints in practice.
Basis functions are constructed on unique location of x
values. A polynomial of degree k
has k + 1
basis and coefficients. poly
generates basis without the intercept term, so degree = k
implies k
basis and k
coefficients. If there are n
unique x
values, it must be satisfied that k <= n
, otherwise there is simply insufficient information to construct a polynomial. Inside poly()
, the following line checks this condition:
if (degree >= length(unique(x)))
stop("'degree' must be less than number of unique points")
Correlation between x ^ k
and x ^ (k+1)
is getting closer and closer to 1 as k
increases. Such approaching speed is of course dependent on x
values. poly
first generates ordinary polynomial basis, then performs QR factorization to find orthogonal span. If numerical rank-deficiency occurs between x ^ k
and x ^ (k+1)
, poly
will also stop and complain:
if (QR$rank < degree)
stop("'degree' must be less than number of unique points")
But the error message is not informative in this case. Furthermore, this does not have to be an error; it can be a warning then poly
can reset degree
to rank
to proceed. Maybe R core can improve on this bit??
Your trial-and-error shows that you can't construct a polynomial of more than 25 degree. You can first check length(unique(q))
. If you have a degree smaller than this but still triggering error, you know for sure it is due to numerical rank-deficiency.
But what I want to say is that a polynomial of more than 3-5 degree is never useful! The critical reason is the Runge's phenomenon. In terms of statistical terminology: a high-order polynomial always badly overfits data!. Don't naively think that because orthogonal polynomials are numerically more stable than raw polynomials, Runge's effect can be eliminated. No, a polynomial of degree k
forms a vector space, so whatever basis you use for representation, they have the same span!
Splines: piecewise cubic polynomials and its use in regression
Polynomial regression is indeed helpful, but we often want piecewise polynomials. The most popular choice is cubic spline. Like that there are different representation for polynomials, there are plenty of representation for splines:
B-spline basis is the most numerically stable, as it has compact support. As a result, the covariance matrix X'X
is banded, thus solving normal equations (X'X) b = (X'y)
are very stable.
In R, we can use bs
function from splines
package (one of R base packages) to get B-spline basis. For bs(x)
, The only numerical constraint on degree of freedom df
is that we can't have more basis than length(unique(x))
.
I am not sure of what your data look like, but perhaps you can try
library(splines)
model <- lm(noisy.y ~ bs(q, df = 10))
Penalized smoothing / regression splines
Regression spline is still likely to overfit your data, if you keep increasing the degree of freedom. The best model seems to be about choosing the best degree of freedom.
A great approach is using penalized smoothing spline or penalized regression spline, so that model estimation and selection of degree of freedom (i.e., "smoothness") are integrated.
The smooth.spline
function in stats
package can do both. Unlike what its name seems to suggest, for most of time it is just fitting a penalized regression spline rather than smoothing spline. Read ?smooth.spline
for more. For your data, you may try
fit <- smooth.spline(q, noisy.y)
(Note, smooth.spline
has no formula interface.)
Additive penalized splines and Generalized Additive Models (GAM)
Once we have more than one covariates, we need additive models to overcome the "curse of dimensionality" while being sensible. Depending on representation of smooth functions, GAM can come in various forms. The most popular, in my opinion, is the mgcv
package, recommended by R.
You can still fit a univariate penalized regression spline with mgcv
:
library(mgcv)
fit <- gam(noisy.y ~ s(q, bs = "cr", k = 10))