I have a training-test function set up in R that takes a set of data, excludes a certain portion of it (to preclude the model overfitting the data), and then trains a linear model on about half of the remaining data before testing the model on the other half.
I should note that set of data are based on PCA scores, which is why the linear model is set to include the seven PCA components.
splitprob = 0.7
trainindex = createDataPartition(scores$y, p=splitprob, list=F)
trainingset = scores[trainindex,]
testset = scores[-trainindex,]
model = glm(y ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7, data=trainingset)
summary(model)
prediction = predict.lm(model, trainingset, se.fit=T)
Now, what I want to do is run this script multiple times, produce multiple models, and then pick one or more models that will be used to make future predictions. Although I have set up the function to be run a certain number of times, I do not know how to set it up so that I can compare the different models against one another (probably by using the AIC), nor am I sure how I should capture parameters of the model(s) in order to export them to a text file or a .csv file.
I tried implementing the glmulti package, but due to various problems in using Java, rJava, and Mac OsSX, I have been having massive problems in getting it to install properly. Could anyone recommend me another approaches to this problem at all?
I updated (modified) the answer to include different predictors each time,
# build data
set.seed(1)
y = runif(100)
x = matrix(runif(1000), nrow = 100, ncol = 10)
function for repeated train - test splits, which returns summary statistics for all iterations (it takes the data, the number of iterations, a formula, and the split-percentage in train-test),
glm_train_test = function(y, x, num_iters, Formula, split_percentage = 0.5) {
list_models = list()
aic_models = matrix(0, nrow = num_iters, ncol = 2)
rmse_models = matrix(0, nrow = num_iters, ncol = 2)
for (i in 1:num_iters) { # train - test - splits
spl = sample(nrow(x), round(nrow(x) * split_percentage), replace = F)
train = data.frame(y = y[spl], x[spl, ])
test = data.frame(y = y[-spl], x[-spl, ])
tmp_formula = as.formula(Formula)
fit = glm(formula = tmp_formula, data = train)
# pred_train = predict(fit, newdata = train)
pred_test = predict(fit, newdata = test)
aic_models[i, ] = c(i, summary(fit)$aic)
rmse_models[i, ] = c(i, Metrics::rmse(y[-spl], pred_test))
}
# convert the resulted aic-rmse matrices to data frames
sort_aic = as.data.frame(aic_models)
colnames(sort_aic) = c('iteration', 'aic_value')
sort_rmse = as.data.frame(rmse_models)
colnames(sort_rmse) = c('iteration', 'rmse_value')
tmp_aic = c(summary(sort_aic[, 2]), sd(sort_aic[, 2]))
names(tmp_aic) = c(names(summary(sort_aic[, 2])), 'std.dev')
tmp_rmse = c(summary(sort_rmse[, 2]), sd(sort_rmse[, 2]))
names(tmp_rmse) = c(names(summary(sort_rmse[, 2])), 'std.dev')
return(list(summary_aic = tmp_aic, summary_rmse = tmp_rmse))
}
first model including only three predictors,
first_mod = glm_train_test(y, x, num_iters = 100, Formula = "y ~ X1 + X2 + X3", split_percentage = 0.5)
example output
first_mod
$summary_aic
Min. 1st Qu. Median Mean 3rd Qu. Max. std.dev
-6.193321 10.527612 13.317441 13.320968 17.019571 26.792904 5.798970
$summary_rmse
Min. 1st Qu. Median Mean 3rd Qu. Max. std.dev
0.23862643 0.26628405 0.27568030 0.27713722 0.28616462 0.33223873 0.01730974
second model including also an interaction,
second_model = glm_train_test(y, x, num_iters = 100, Formula = "y ~ X1 * X4 + X2 + X3", split_percentage = 0.5)
example output for the second model,
second_model
$summary_aic
Min. 1st Qu. Median Mean 3rd Qu. Max. std.dev
-0.04232767 13.37572489 16.92720680 16.81206625 20.79921756 29.67830243 5.96477080
$summary_rmse
Min. 1st Qu. Median Mean 3rd Qu. Max. std.dev
0.23912727 0.27026791 0.28117878 0.28177088 0.29526944 0.32674985 0.01819308
You can use an evaluation metric that is appropriate for your data and you can also have a look to the difference between 'aic' and 'bic' in another stackoverflow question.