r performance linear-interpolation loess

How to interpolate and extract points above loess smooth in R?

(This post is the second half of the problem from: How to apply loess.smoothing to both plot and then extract points?)

I have plotted loess smoothing to a scatterplot (i.e. between two quantitative variables). I would like to extract only the data points in the scatterplot that are above that smoothing line.

For instance, if this is my scatterplot:

qplot(mpg, cyl, data=mtcars)

I can plot the smoother as:

qplot(hp,wt,data=mtcars) + stat_smooth(method="loess")

Now, I want to extract only the data points that are above the smoother. I have played with code provided in (Method to extract stat_smooth line fit):

model <- loess(wt ~ hp, data=mtcars)
xrange <- range(mtcars$hp)
xseq <- seq(from=xrange[1], to=xrange[2], length=80)
pred <- predict(model, newdata = data.frame(hp = xseq), se=TRUE)
y = pred$fit
ci <- pred$se.fit * qt(0.95 / 2 + .5, pred$df)
ymin = y - ci
ymax = y + ci
loess.DF <- data.frame(x = xseq, y, ymin, ymax, se = pred$se.fit)

This results in a data frame loess.DF of 80 rows and 5 columns.

I know now that I must apply a function to go through each row of the original mtcars data frame, and interpolate for each x-value (hp) its closest predicted loess y-value (wt). My only idea to accomplish this interpolation is to use a linear interpolation similar to (http://www.ajdesigner.com/phpinterpolation/linear_interpolation_equation.php). After that, I would simply compare the y-value in mtcars with the interpolated predicted loess y-value. If the y-value in mtcars is larger than the predicted loess y-value, then I keep that original data point; else, I remove it.

I started to code this, but realize that I cannot do so in an efficient way. One problem is that my (real) data set (which isn't mtcars) is very large (~40,000 rows): First, to do the linear interpolation, I need to find the two x values in the loess fit that are closest to the original x value in my data set (if there isn't exactly one perfect match), and I do not know how to do that efficiently without searching through increasing loess x-values.

How is an efficient way to approach this, for example, testing it out first on the mtcars data set? Thank you.

Solution

You have this automatically as the residuals list component returned by loess:

> str(model)
List of 17
 $ n        : int 32
 $ fitted   : num [1:32] 2.83 2.83 2.57 2.83 3.74 ...
 $ residuals: Named num [1:32] -0.2133 0.0417 -0.2477 0.3817 -0.2997 ...
  ..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
 $ enp      : num 4.94
 $ s        : num 0.655
 $ one.delta: num 26.1
 $ two.delta: num 25.8
 $ trace.hat: num 5.43
 $ divisor  : num 1
...

If you do: model$residuals, positive values are above the line and negative lines are below:

> which(sign(model$residuals) == 1)
      Mazda RX4 Wag      Hornet 4 Drive             Valiant           Merc 240D            Merc 230            Merc 280 
                  2                   4                   6                   8                   9                  10 
          Merc 280C          Merc 450SE  Cadillac Fleetwood Lincoln Continental   Chrysler Imperial            Fiat 128 
                 11                  12                  15                  16                  17                  18 
   Dodge Challenger         AMC Javelin    Pontiac Firebird       Maserati Bora 
                 22                  23                  25                  31

The above result is all points from the original data that are above the LOESS curve.