In using R's lm()
function to calculate the line of best fit for my data, I've run into an issue: one or two major outliers in my data set are forcing the line to be somewhere where it doesn't help me understand my data.
My goal is to change the method lm()
is using to calculate the line from the sum of squares of the residuals to the sum of absolute values of the residuals.
Does anyone know how to do this?
I'm going to suggest an alternative approach, robust linear models; these don't use mean (or sum) of absolute deviations, but rather downweight the effect of outliers. MASS::rlm
has essentially the same syntax as lm
: here I'm illustrating it in a ggplot
context.
You could also use robustbase::lmrob()
for a different implementation of the same approach, or (as suggested by G. Grothendieck) quantreg::rq()
to fit a straight-line model for the median (which basically corresponds to what you asked for in the first place, a MAD regression).
library(MASS)
set.seed(101)
## generate correlated data (positive slope)
dd <- as.data.frame(MASS::mvrnorm(20, mu=c(0,0),
Sigma=matrix(c(1,0.95,0.95,1),2)))
dd <- rbind(dd, c(5,-5)) ## add an outlier
library(ggplot2); theme_set(theme_classic())
ggplot(dd, aes(V1,V2)) +
geom_point() + geom_smooth(method="lm") +
geom_smooth(method="rlm", colour="red")