(This post is the second half of the problem from: How to apply loess.smoothing to both plot and then extract points?)
I have plotted loess smoothing to a scatterplot (i.e. between two quantitative variables). I would like to extract only the data points in the scatterplot that are above that smoothing line.
For instance, if this is my scatterplot:
qplot(mpg, cyl, data=mtcars)
I can plot the smoother as:
qplot(hp,wt,data=mtcars) + stat_smooth(method="loess")
Now, I want to extract only the data points that are above the smoother. I have played with code provided in (Method to extract stat_smooth line fit):
model <- loess(wt ~ hp, data=mtcars)
xrange <- range(mtcars$hp)
xseq <- seq(from=xrange[1], to=xrange[2], length=80)
pred <- predict(model, newdata = data.frame(hp = xseq), se=TRUE)
y = pred$fit
ci <- pred$se.fit * qt(0.95 / 2 + .5, pred$df)
ymin = y - ci
ymax = y + ci
loess.DF <- data.frame(x = xseq, y, ymin, ymax, se = pred$se.fit)
This results in a data frame loess.DF of 80 rows and 5 columns.
I know now that I must apply a function to go through each row of the original mtcars data frame, and interpolate for each x-value (hp) its closest predicted loess y-value (wt). My only idea to accomplish this interpolation is to use a linear interpolation similar to (http://www.ajdesigner.com/phpinterpolation/linear_interpolation_equation.php). After that, I would simply compare the y-value in mtcars with the interpolated predicted loess y-value. If the y-value in mtcars is larger than the predicted loess y-value, then I keep that original data point; else, I remove it.
I started to code this, but realize that I cannot do so in an efficient way. One problem is that my (real) data set (which isn't mtcars) is very large (~40,000 rows): First, to do the linear interpolation, I need to find the two x values in the loess fit that are closest to the original x value in my data set (if there isn't exactly one perfect match), and I do not know how to do that efficiently without searching through increasing loess x-values.
How is an efficient way to approach this, for example, testing it out first on the mtcars data set? Thank you.
You have this automatically as the residuals
list component returned by loess
:
> str(model)
List of 17
$ n : int 32
$ fitted : num [1:32] 2.83 2.83 2.57 2.83 3.74 ...
$ residuals: Named num [1:32] -0.2133 0.0417 -0.2477 0.3817 -0.2997 ...
..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
$ enp : num 4.94
$ s : num 0.655
$ one.delta: num 26.1
$ two.delta: num 25.8
$ trace.hat: num 5.43
$ divisor : num 1
...
If you do: model$residuals
, positive values are above the line and negative lines are below:
> which(sign(model$residuals) == 1)
Mazda RX4 Wag Hornet 4 Drive Valiant Merc 240D Merc 230 Merc 280
2 4 6 8 9 10
Merc 280C Merc 450SE Cadillac Fleetwood Lincoln Continental Chrysler Imperial Fiat 128
11 12 15 16 17 18
Dodge Challenger AMC Javelin Pontiac Firebird Maserati Bora
22 23 25 31
The above result is all points from the original data that are above the LOESS curve.