Search code examples
rregressionlinear-regressionlmpolynomials

High (or very high) order polynomial regression in R (or alternatives?)


I would like to fit a (very) high order regression to a set of data in R, however the poly() function has a limit of order 25.

For this application I need an order on the range of 100 to 120.

model <- lm(noisy.y ~ poly(q,50))
# Error in poly(q, 50) : 'degree' must be less than number of unique points
model <- lm(noisy.y ~ poly(q,30))
# Error in poly(q, 30) : 'degree' must be less than number of unique points
model <- lm(noisy.y ~ poly(q,25))
# OK

Solution

  • Polynomials and orthogonal polynomials

    poly(x) has no hard-coded limit for degree. However, there are two numerical constraints in practice.

    • Basis functions are constructed on unique location of x values. A polynomial of degree k has k + 1 basis and coefficients. poly generates basis without the intercept term, so degree = k implies k basis and k coefficients. If there are n unique x values, it must be satisfied that k <= n, otherwise there is simply insufficient information to construct a polynomial. Inside poly(), the following line checks this condition:

      if (degree >= length(unique(x))) 
          stop("'degree' must be less than number of unique points")
      
    • Correlation between x ^ k and x ^ (k+1) is getting closer and closer to 1 as k increases. Such approaching speed is of course dependent on x values. poly first generates ordinary polynomial basis, then performs QR factorization to find orthogonal span. If numerical rank-deficiency occurs between x ^ k and x ^ (k+1), poly will also stop and complain:

      if (QR$rank < degree) 
          stop("'degree' must be less than number of unique points")
      

      But the error message is not informative in this case. Furthermore, this does not have to be an error; it can be a warning then poly can reset degree to rank to proceed. Maybe R core can improve on this bit??

    Your trial-and-error shows that you can't construct a polynomial of more than 25 degree. You can first check length(unique(q)). If you have a degree smaller than this but still triggering error, you know for sure it is due to numerical rank-deficiency.

    But what I want to say is that a polynomial of more than 3-5 degree is never useful! The critical reason is the Runge's phenomenon. In terms of statistical terminology: a high-order polynomial always badly overfits data!. Don't naively think that because orthogonal polynomials are numerically more stable than raw polynomials, Runge's effect can be eliminated. No, a polynomial of degree k forms a vector space, so whatever basis you use for representation, they have the same span!


    Splines: piecewise cubic polynomials and its use in regression

    Polynomial regression is indeed helpful, but we often want piecewise polynomials. The most popular choice is cubic spline. Like that there are different representation for polynomials, there are plenty of representation for splines:

    • truncated power basis
    • natural cubic spline basis
    • B-spline basis

    B-spline basis is the most numerically stable, as it has compact support. As a result, the covariance matrix X'X is banded, thus solving normal equations (X'X) b = (X'y) are very stable.

    In R, we can use bs function from splines package (one of R base packages) to get B-spline basis. For bs(x), The only numerical constraint on degree of freedom df is that we can't have more basis than length(unique(x)).

    I am not sure of what your data look like, but perhaps you can try

    library(splines)
    model <- lm(noisy.y ~ bs(q, df = 10))
    

    Penalized smoothing / regression splines

    Regression spline is still likely to overfit your data, if you keep increasing the degree of freedom. The best model seems to be about choosing the best degree of freedom.

    A great approach is using penalized smoothing spline or penalized regression spline, so that model estimation and selection of degree of freedom (i.e., "smoothness") are integrated.

    The smooth.spline function in stats package can do both. Unlike what its name seems to suggest, for most of time it is just fitting a penalized regression spline rather than smoothing spline. Read ?smooth.spline for more. For your data, you may try

    fit <- smooth.spline(q, noisy.y)
    

    (Note, smooth.spline has no formula interface.)


    Additive penalized splines and Generalized Additive Models (GAM)

    Once we have more than one covariates, we need additive models to overcome the "curse of dimensionality" while being sensible. Depending on representation of smooth functions, GAM can come in various forms. The most popular, in my opinion, is the mgcv package, recommended by R.

    You can still fit a univariate penalized regression spline with mgcv:

    library(mgcv)
    fit <- gam(noisy.y ~ s(q, bs = "cr", k = 10))