Search code examples
rlinear-regressiondata-analysiscross-validationk-fold

Generic Function for K-Fold Cross-Validation In R for Linear Models


Hi guys I need help truble shooting the fucntion below. I am using R language.

The dataset i am using is called wages and it is from a package called library(ISLR) data(wages).

Anyhow, I am trying to develop a function that allows me to perform k-fold cross-validation on any general linear models.

The inputs/arguments to the function i am using are function(numberOfFolds, y,x,InputData)

y is the dependent variable x is all the other variables in the dataset inputdata is the dataset of wages numberOfFolds is k basically.

I have developed the below code but i am getting NaN values. Not sure what is going on wrong! Could someone please help

my.k.fold.1<- function(numberOfFolds, y,x,inputData){
  index<-sample(1:numberOfFolds, nrow(inputData), replace = T)
  inputData$index<-index
  
  mse<-vector('numeric', length = numberOfFolds)
  for (n in 1:numberOfFolds) {
    data.train<-inputData[index!=n,]
    data.test<-inputData[index==n,]
    my.equation<-paste(y,paste(x, collapse = '+'),sep='~')
    formula.1<-formula(my.equation)
    model.test<-lm(formula.1, data = data.train)
    predictions<-predict(model.test, newdata=data.test)
    mse[[n]]<-mean((data.test$y-predictions)^2)
  }
  return(mse)
}

my.k.fold.1(numberOfFolds = 5, y='earn', x=c('race', 'sex', 'ed', 'height', 'age'), inputData = wages)

i would like to keep the arguments the same and i can write down the column names in the y and xs


Solution

  • This is because the y variable is a string, so data.test$y is equivalent to data.test[["y"]]. You should replace it with data.test[[y]], which is equivalent to data.test$earn if y="earn":

    my.k.fold.1<- function(numberOfFolds, y,x,inputData){
      index<-sample(1:numberOfFolds, nrow(inputData), replace = T)
      inputData$index<-index
      
      mse<-vector('numeric', length = numberOfFolds)
      for (n in 1:numberOfFolds) {
        data.train<-inputData[index!=n,]
        data.test<-inputData[index==n,]
        my.equation<-paste(y,paste(x, collapse = '+'),sep='~')
        formula.1<-formula(my.equation)
        model.test<-lm(formula.1, data = data.train)
        predictions<-predict(model.test, newdata=data.test)
        mse[[n]]<-mean((data.test[[y]]-predictions)^2)
      }
      return(mse)
    }