Search code examples
rlinear-regressionuser-defined-functionsprediction

r function: multiple linear regression prediction estimate and interval (user-defined function)


I am working on a user-defined function in r to calculate prediction estimate and intervals from a linear regression at 95%. I have a function which replicates the predict.lm() function fit and interval. However when applied to multiple linear regression I have slight differences at the third decimal which I cannot explain why. I do not have a solid theoretical mathematics background so I have used this website and the formulas explained to integrate into my function : https://daviddalpiaz.github.io/appliedstats/multiple-linear-regression.html

Is there errors in my function or is the slight differences due to rounding errors or others marginal errors ? Below, there is the function code and how i applied it to test it:

predict.reg.95 <- function(lm.model,newdata) {
  if (!inherits(lm.model, "lm")){warning("object is not a lm() model")}
  else{
    n<-length(lm.model$residuals)
    beta <- lm.model$coefficients
    sy<-sigma(lm.model)
    s2x<-var(lm.model$model[,2])
    t.alpha.demi<- qt(0.975, df=n-2)
      if (length(beta)-1==length(newdata)) {
        y.pred <- beta[1]+sum(beta[-1]*newdata)
        x0<-c(1,newdata)
        X<-cbind(c(rep(1,n)),lm.model$model[,-1])
        y.pred.interval.upp<-y.pred+t.alpha.demi*sy*sqrt(1+x0%*%solve(t(X)%*%as.matrix(X))%*%x0)
        y.pred.interval.low<-y.pred-t.alpha.demi*sy*sqrt(1+x0%*%solve(t(X)%*%as.matrix(X))%*%x0)
        fit<-c(y.pred)
        upr<-c(y.pred.interval.upp)
        lwr<-c(y.pred.interval.low)
        output<-cbind(lwr,fit,upr)
        print("Below, you will find the predicted estimate (fit) with the given values of the explanatory variables and the associated prediction interval (lwr,upr)")
        print(output)
      } 
      else {
      print("the length of the chosen explanatory variables vector isn't the same length as the number of explanatory variables of your lm() model")
     }
    }
  }
#1st test
df<-data.frame(x1=c(sample.int(100,50, replace=T)),y=c(sample.int(200,50, replace=T)),x2=c(sample.int(20,50, replace=T)))
lm.model<-lm(y~x1+x2,data=df)
x1<-c(12)
x2<-c(32)
newdata<-as.data.frame(cbind(x1,x2))
new<-c(12,32)

predict.reg.95(lm.model,new)
predict(lm.model, newdata, level=0.95,interval="prediction") #slight difference at the third decimal for the prediction interval between functions

#2nd test
data(Seatbelts)
lm.model<-lm(DriversKilled~kms+drivers,data=Seatbelts)
kms<-c(10000)
drivers<-c(2000)
newdata<-as.data.frame(cbind(kms,drivers))
new<-c(10000,2000)

predict.reg.95(lm.model,new)
predict(lm.model, newdata, level=0.95,interval="prediction") #slight difference at the third decimal for the prediction interval between functions

I hope there is a solution or that the problem isn't a big problem and the function can be used as is.

Respectfully,

Cyril S


Solution

  • Update: I found the problem, it was at t value t.alpha.demi<- qt(0.975, df=n-2) which explains why it didn't have the difference with single linear regression but did with multiple.

    I changed it to t.alpha.demi<- qt(0.975, df=n-length(beta))

    It was a mistake on my end. Regards, Cyril S