Search code examples
rinterpolationpredictloessextrapolation

Predicting via Lowess in R (OR reconciling Loess & Lowess)


I'm trying to interpolate/locally extrapolate some salary data to fill out a data set.

Here's the data set and a plot of the available data:

    experience   salary
 1:          1 21878.67
 2:          2 23401.33
 3:          3 23705.00
 4:          4 24260.00
 5:          5 25758.60
 6:          6 26763.40
 7:          7 27920.00
 8:          8 28600.00
 9:          9 28820.00
10:         10 32600.00
11:         12 30650.00
12:         14 32600.00
13:         15 32600.00
14:         16 37700.00
15:         17 33380.00
16:         20 36784.33
17:         23 35600.00
18:         25 33590.00
19:         30 32600.00
20:         31 33920.00
21:         35 32600.00

A scatterplot of the data given above in tabular form, titled "Experience vs. Salary", with an x-axis labelled "Experience" varying from 0 to 35 and a y-axis labelled "$" with labels at 25,000, 30,000, and 35,000. The points are essentially in a bilinear shape -- increasing steadily from 0 to about 20 years, and plateauing after that.

Given the clear nonlinearity, I'm hoping to interpolate & extrapolate (I want to fill in experience for years 0 through 40) via a local linear estimator, so I defaulted to lowess, which gives this:

A plot with the same title, axes, and scatterplot points as above, with a red line superimposed giving the fit from the lowess function, which generally follows the data well.

This is nice on the plot, but the raw data is missing -- R's plotting device has filled in the blanks for us. I haven't been able to find a predict method for this function, as it seems R is moving towards using loess, which as I understand is a generalization.

However, when I use loess (setting surface="direct" to be able to extrapolate, as mentioned in ?loess), which has a standard predict method, the fit is less satisfactory:

Another plot with the same baseline data, this time showing a blue line superimposed showing the fit from the loess function; this fit is in a U shape, increasing first before decreasing after around 20 Years

(There are strong theoretical reasons to say that salary should be non-decreasing--there is some noise/possible mis-measurement driving the U shape here)

And I can't seem to be able to fiddle around with any of the parameters to get back the non-decreasing fit given by lowess.

Any suggestions for what to do?


Solution

  • I don't know how to "reconcile" those two functions but I have used the cobs package (COnstrained B-Splines Nonparametric Regression Quantiles ) with some success for similar tasks. The default quantile is the (local) median or 0.5 quantile. In this dataset the default choices for span or kernel width seem very appropriate.

    require(cobs)
    Loading required package: cobs
    Package cobs (1.3-0) attached.  To cite, see citation("cobs")
    
     Rbs <- cobs(x=dat$experience,y=dat$salary, constraint= "increase")
    qbsks2():
    # Performing general knot selection ...
    #
    # Deleting unnecessary knots ...
     Rbs
    #COBS regression spline (degree = 2) from call:
    #    cobs(x = dat$experience, y = dat$salary, constraint = "increase")
    #{tau=0.5}-quantile;  dimensionality of fit: 5 from {5}
    #x$knots[1:4]:  0.999966,  5.000000, 15.000000, 35.000034
    plot(Rbs, lwd = 2.5)
    

    enter image description here

    It does have a predict method although you will need to use idiosyncratic arguments since it doesn't support the usual data= formalism:

     help(predict.cobs)
     predict(Rbs, z=seq(0,40,by=5))
           z      fit
     [1,]  0 21519.83
     [2,]  5 25488.71
     [3,] 10 30653.44
     [4,] 15 32773.21
     [5,] 20 33295.84
     [6,] 25 33669.14
     [7,] 30 33893.12
     [8,] 35 33967.78
     [9,] 40 33893.12