Search code examples
rtime-seriesforecastinglmpredict

R - Determine goodness of fit of new data with predict function based on existing lm


I am trying to apply an existing model to a new data set. I try to explain it with an example. I am wondering what an elegant way to determine the goodness of the fit would look like.

Basically, I run a regression and obtain a model. With the summary function I obtain the usual output such as adjusted R-squared, p-value etc.

model.lm <- lm(Sepal.Length ~ Petal.Length, data = iris[1:75,])
summary(model.lm)

Now I want to run the predict function on new data and I am curious to know how the model performs on the new data.

pred.dat <- predict(model.lm, newdata = iris[76:150,])

I wanted to ask how I can for instance get an adjusted R-squared for the predicted values with the new data. For instance, is there something similar like the summary function? Ideally, I would like to find out what the best practice of obtaining the goodness of fit of a an existing model based on new data looks like.

Many thanks


Solution

  • You can translate the formula of R-squared into a function, such as:

    r_squared <- function(vals, preds) {
      1 - (sum((vals - preds)^2) / sum((vals - mean(preds))^2))
    }
    # Test
    > r_squared(iris[76:150,]$Sepal.Length, pred.dat)
    #[1] 0.5675686
    

    Building upon this function, and using the correct formula we can also define adjusted R-squared as:

    r_squared_a <- function(vals, preds, k) {
      1 - ((1-r_squared(vals, preds))*(length(preds)-1))/(length(preds) - k - 1)
    }
    

    Where k is the number of predictors, thus:

    > r_squared_a(iris[76:150,]$Sepal.Length, pred.dat, 1)
    #[1] 0.5616448