The problem:
I have a dataset inputAll.data
. I want to use 80% of the data as model construction input and validate the model on the remaining 20% of data.
I have manually split the dataset into two smaller datasets input80.data
and input20.data
containing 80% and 20% of the data respectively.
Format of data in my datasets:
Name xvalues yvalues
Prog1 0.654219 59.70282
Prog2 0.149516 49.59548
Prog3 0.50577 50.53859
Prog4 0.77783 59.95499
Prog5 0.237923 49.61133
Prog6 0.756063 50.63021
Prog7 0.015625 53.77959
I am using 80% of the data to construct a non-linear regression model using nls
.
df = data.frame(input80.data)
yval = df$yvalues
xval = df$xvalues
model1 = nls(formula = yval ~ exp(xval + beta * xval), start = list(beta = 0))
sm1 = summary(model1)
fit1 = fitted.values(model1)
I am taking the remaining 20% data to obtain predicted values. I saved a copy of this data which contains the actual y
values in another file called input20Actual.data
, but input20.data
only contains the x
values.
dfNew = data.frame(input20.data)
xpred = dfNew$xvalues
dfVerify = data.frame(input20Actual.data)
yverify = dfVerify$yvalues
xverify = dfVerify$xvalues
obtainedPred = predict(model1, data.frame(xvalues = c(xpred) ))
I am then using a custom function called RMSE
to calculate the error between the prediction and the actual value.
RMSE <- function(fitted, actual){
sqrt(mean((fitted - actual)^2))
}
The error calculation is done by taking each predicted value and comparing it to the actual value that I had stored in input20Actual.data
. I am storing the output in a file.
sink("ErrorsOut.txt")
cat("\n\nRMSE:\n")
for (i in 1:13) {
#There are 13 values to be predicted in input20.data
corr = obtainedPred[[i]]
act = yverify[[i]]
err = RMSE(act, corr)
cat(err)
cat(" ")
}
cat("\n")
sink()
The problem is that I have split the input set manually. I would like to automate this, and do the same thing for different splits (different data each time) and obtain an average of the calculated errors.
What I tried:
I have read on StackOverflow about cross-validation in R. My understanding is that it iteratively takes some % of data for model creation and the remaining for testing. If I can use a cross validation function in nls
, I don't have to split my input data into two files.
I have searched on SO a lot for a solution. Many answers about cross-validation were for lm
. But I specifically require cross-validation for nls
. I also read about the caret
package, but I have tried to install it and but most of the time I end up getting package installation errors, like the one below:
Warning: dependency ‘plyr’ is not available
package ‘plyr’ is not available (for R version 3.0.2)
So I was hoping there was a direct way to perform cross-validation (in rkward) without installing more packages. Is there a function or API in R that I can use for iteratively creating models and testing them?
Please note that I am a complete newbie to R. Sorry if this is an obvious question.
Using the builtin data frame BOD
try the simple model shown in fo
below. First use sample
to get the indexes of the in-sample rows and run the model on those. predict.nls
is then used to get the predicted values using the out-of-sample data with the in-sample model. From that the residual sum of squares (RSS) and other results can be calculated. Each time this is run sample
will generate a possibly different set of indexes (provided set.seed
is not rerun). This could be packaged in a function and run repeatedly. No packages are used.
set.seed(123) # for reproducibility
n <- nrow(BOD)
frac <- 0.8
ix <- sample(n, frac * n) # indexes of in sample rows
fo <- demand ~ a + Time * b
fm <- nls(fo, BOD, start = c(a = 0, b = 0), subset = ix) # in sample model
BOD.out <- BOD[-ix, ] # out of sample data
pred <- predict(fm, new = BOD.out)
act <- BOD.out$demand
RSS <- sum( (pred - act)^2 )
RSS