Search code examples
rdataframepredictionlm

prediction using linear model and the importance of data.frame


I am writing to ask why should we add data.frame() to predict by using lm

the first chunk of code is supposed to be wrong and the second chunk of code is supposed to be correct.

dim(iris)
model_1<-lm(Sepal.Length~Sepal.Width, data=iris)
summary(model_1)
print(predict(model_1, Sepal.Width=c(1,3,4,5)))

dim(iris)
model_1<-lm(Sepal.Length~Sepal.Width, data=iris)
summary(model_1)
print(predict(model_1,data.frame(Sepal.Width=c(1,3,4,5))))

Solution

  • When you call predict on a lm object, the function called is predict.lm. When you run it like:

    predict(model_1, Sepal.Width=c(1,3,4,5))
    

    What you are doing is providing c(1,3,4,5) an argument or parameter to Sepal.Width, which predict.lm ignores since this argument does not exist for this function.

    When there is no new input data, you are running predict.lm(model_1), and getting back the fitted values:

    table(predict(model_1) == predict(model_1, Sepal.Width=c(1,3,4,5)))
    
    TRUE 
     150
    

    In this case, you fitted the model with a formula, the predict.lm function needs your data frame to reconstruct the independent or exogenous matrix, matrix multiply with the coefficients and return you the predicted values.

    This is briefly what predict.lm is doing:

    newdata = data.frame(Sepal.Width=c(1,3,4,5))
    Terms = delete.response(terms(model_1))
    X = model.matrix(Terms,newdata)
    
    X
      (Intercept) Sepal.Width
    1           1           1
    2           1           3
    3           1           4
    4           1           5
    
    X %*% coefficients(model_1)
          [,1]
    1 6.302861
    2 5.856139
    3 5.632778
    4 5.409417
    
    predict(model_1,newdata)
    
           1        2        3        4 
    6.302861 5.856139 5.632778 5.409417