Search code examples
rlinear-regressioncross-validationr-caretstandard-error

R - k-fold cross-validation for linear regression with standard error of estimate


I would like to perform k-fold cross-validation in R for a linear regression model and test the one standard error rule:

https://stats.stackexchange.com/questions/17904/one-standard-error-rule-for-variable-selection

Thus, I need a function which gives me back the cross-validation estimate of the prediction error and the standard error of this estimate (or at least the MSE for each fold, so that I can compute the standard error myself). Many packages have functions which compute the cross-validation error (for example, cv.glm in the boost package), but usually they return only the CV estimate of the prediction error and not its standard error, or the MSE for each fold.

I tried using package DAAG, whose function CVlm should give a richer output than cv.glm. However, I can't seem to make it work! Here is my code:

a=c(0.0056, 0.0088, 0.0148, 0.0247, 0.0392, 0.0556, 0.0632, 0.0686, 0.0786, 0.0855, 0.0937)
b=c(6.0813, 9.5011, 15.5194, 23.9409, 32.8492, 40.8399, 43.8760, 45.5270, 46.7668, 46.1587, 43.4524)
dataset=data.frame(x=a,y=b)
CV.list=CVlm(df=dataset,form.lm = formula(y ~ poly(x,2)), m=5)

I get the hardly informative error

Error in xy.coords(x, y, xlabel, ylabel, log) : 
'x' and 'y' lengths differ 

which doesn't make much sense to me. x and y are the same length (11), so clearly the function is complaining about some other x,y variables it created internally.

I'd gladly accept solutions with other packages (for example caret). Also, it would be great if I could specify a number of repetitions for the k-fold cross-validation.


Solution

  • CVlm doesn't like the poly(x,2) in your formula. You can easily circumvent that by adding the result of poly(x,2) first in your datatable and calling CVlm on these new variables:

    dataset2 <- cbind(dataset,poly(dataset$x,2))
    names(dataset2)[3:4] <- c("p1","p2")
    CV.list=CVlm(df=dataset2,form.lm = formula(y ~ p1+p2))
    

    And as you are interested by the values printed that are unfortunately not saved anywhere you can use something like:

    # captures the printed output
    printOut <- capture.output(CV.list=CVlm(df=dataset2,form.lm = formula(y ~ p1+p2)))
    
    # function to parse the output 
    # to be adapted if necessary for your needs
    GetValues <- function(itemName,printOut){
        line <- printOut[grep(itemName,printOut)]
        items <- unlist(strsplit(line,"[=]|  +"))
        itemsMat <- matrix(items,ncol=2,byrow=TRUE)
        vectVals <- as.numeric(itemsMat[grep(itemName,itemsMat[,1]),2])
        return(vectVals)
    }
    
    # get the Mean square values as a vector
    MS <- GetValues("Mean square",printOut)