Search code examples
rcross-validationnlsnon-linear-regression

Cross-validation for non-linear regression using nls in R


The problem:

I have a dataset inputAll.data. I want to use 80% of the data as model construction input and validate the model on the remaining 20% of data.

I have manually split the dataset into two smaller datasets input80.data and input20.data containing 80% and 20% of the data respectively.

Format of data in my datasets:

Name      xvalues     yvalues
Prog1     0.654219    59.70282
Prog2     0.149516    49.59548
Prog3     0.50577     50.53859
Prog4     0.77783     59.95499
Prog5     0.237923    49.61133
Prog6     0.756063    50.63021
Prog7     0.015625    53.77959

I am using 80% of the data to construct a non-linear regression model using nls.

df = data.frame(input80.data)
yval = df$yvalues
xval = df$xvalues
model1 = nls(formula = yval ~ exp(xval + beta * xval), start = list(beta = 0))
sm1 = summary(model1)
fit1 = fitted.values(model1)

I am taking the remaining 20% data to obtain predicted values. I saved a copy of this data which contains the actual y values in another file called input20Actual.data, but input20.data only contains the x values.

dfNew = data.frame(input20.data)
xpred = dfNew$xvalues
dfVerify = data.frame(input20Actual.data)
yverify = dfVerify$yvalues
xverify = dfVerify$xvalues

obtainedPred = predict(model1, data.frame(xvalues = c(xpred) ))

I am then using a custom function called RMSE to calculate the error between the prediction and the actual value.

RMSE <- function(fitted, actual){
  sqrt(mean((fitted - actual)^2))
}

The error calculation is done by taking each predicted value and comparing it to the actual value that I had stored in input20Actual.data. I am storing the output in a file.

sink("ErrorsOut.txt")
cat("\n\nRMSE:\n")
for (i in 1:13) {
    #There are 13 values to be predicted in input20.data
    corr = obtainedPred[[i]]
    act = yverify[[i]]
    err = RMSE(act, corr)
    cat(err)
    cat(" ")
}
cat("\n")
sink()

The problem is that I have split the input set manually. I would like to automate this, and do the same thing for different splits (different data each time) and obtain an average of the calculated errors.

What I tried:

I have read on StackOverflow about cross-validation in R. My understanding is that it iteratively takes some % of data for model creation and the remaining for testing. If I can use a cross validation function in nls, I don't have to split my input data into two files.

I have searched on SO a lot for a solution. Many answers about cross-validation were for lm. But I specifically require cross-validation for nls. I also read about the caret package, but I have tried to install it and but most of the time I end up getting package installation errors, like the one below:

Warning: dependency ‘plyr’ is not available
package ‘plyr’ is not available (for R version 3.0.2)

So I was hoping there was a direct way to perform cross-validation (in rkward) without installing more packages. Is there a function or API in R that I can use for iteratively creating models and testing them?

Please note that I am a complete newbie to R. Sorry if this is an obvious question.


Solution

  • Using the builtin data frame BOD try the simple model shown in fo below. First use sample to get the indexes of the in-sample rows and run the model on those. predict.nls is then used to get the predicted values using the out-of-sample data with the in-sample model. From that the residual sum of squares (RSS) and other results can be calculated. Each time this is run sample will generate a possibly different set of indexes (provided set.seed is not rerun). This could be packaged in a function and run repeatedly. No packages are used.

    set.seed(123) # for reproducibility
    
    n <- nrow(BOD)
    frac <- 0.8
    ix <- sample(n, frac * n) # indexes of in sample rows
    
    fo <- demand ~ a + Time * b
    fm <- nls(fo, BOD, start = c(a = 0, b = 0), subset = ix) # in sample model
    
    BOD.out <- BOD[-ix, ] # out of sample data
    pred <- predict(fm, new = BOD.out)
    act <- BOD.out$demand
    RSS <- sum( (pred - act)^2 )
    RSS