Search code examples
pythonrmodelingtraining-data

How can I create multiple data models, then pick the model(s) that are evaluated to be the best fitting?


I have a training-test function set up in R that takes a set of data, excludes a certain portion of it (to preclude the model overfitting the data), and then trains a linear model on about half of the remaining data before testing the model on the other half.

I should note that set of data are based on PCA scores, which is why the linear model is set to include the seven PCA components.

splitprob = 0.7
trainindex = createDataPartition(scores$y, p=splitprob, list=F)
trainingset = scores[trainindex,]
testset = scores[-trainindex,]
model = glm(y ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7, data=trainingset)
summary(model)
prediction = predict.lm(model, trainingset, se.fit=T) 

Now, what I want to do is run this script multiple times, produce multiple models, and then pick one or more models that will be used to make future predictions. Although I have set up the function to be run a certain number of times, I do not know how to set it up so that I can compare the different models against one another (probably by using the AIC), nor am I sure how I should capture parameters of the model(s) in order to export them to a text file or a .csv file.

I tried implementing the glmulti package, but due to various problems in using Java, rJava, and Mac OsSX, I have been having massive problems in getting it to install properly. Could anyone recommend me another approaches to this problem at all?


Solution

  • I updated (modified) the answer to include different predictors each time,

    # build data
    
    set.seed(1)
    y = runif(100)
    x = matrix(runif(1000), nrow = 100, ncol = 10)
    

    function for repeated train - test splits, which returns summary statistics for all iterations (it takes the data, the number of iterations, a formula, and the split-percentage in train-test),

    glm_train_test = function(y, x, num_iters, Formula, split_percentage = 0.5) {
    
      list_models = list()
    
      aic_models = matrix(0, nrow = num_iters, ncol = 2)
    
      rmse_models = matrix(0, nrow = num_iters, ncol = 2)
    
    
      for (i in 1:num_iters) {                                        # train - test - splits
    
        spl = sample(nrow(x), round(nrow(x) * split_percentage), replace = F) 
    
        train = data.frame(y = y[spl], x[spl, ])
    
        test = data.frame(y = y[-spl], x[-spl, ])
    
        tmp_formula = as.formula(Formula)
    
        fit = glm(formula = tmp_formula, data = train)
    
        # pred_train = predict(fit, newdata = train)
    
        pred_test = predict(fit, newdata = test)
    
        aic_models[i, ] = c(i, summary(fit)$aic)
    
        rmse_models[i, ] = c(i, Metrics::rmse(y[-spl], pred_test))
      }
    
    
      # convert the resulted aic-rmse matrices to data frames
    
      sort_aic = as.data.frame(aic_models)
      colnames(sort_aic) = c('iteration', 'aic_value')
    
      sort_rmse = as.data.frame(rmse_models)
      colnames(sort_rmse) = c('iteration', 'rmse_value')
    
      tmp_aic = c(summary(sort_aic[, 2]), sd(sort_aic[, 2]))
      names(tmp_aic) = c(names(summary(sort_aic[, 2])), 'std.dev')
    
      tmp_rmse = c(summary(sort_rmse[, 2]), sd(sort_rmse[, 2]))
      names(tmp_rmse) = c(names(summary(sort_rmse[, 2])), 'std.dev')
    
      return(list(summary_aic = tmp_aic, summary_rmse = tmp_rmse))
    }
    

    first model including only three predictors,

    first_mod = glm_train_test(y, x, num_iters = 100, Formula = "y ~ X1 + X2 + X3", split_percentage = 0.5)
    

    example output

    first_mod
    
    $summary_aic
         Min.   1st Qu.    Median      Mean   3rd Qu.      Max.   std.dev 
    -6.193321 10.527612 13.317441 13.320968 17.019571 26.792904  5.798970 
    
    $summary_rmse
          Min.    1st Qu.     Median       Mean    3rd Qu.       Max.    std.dev 
    0.23862643 0.26628405 0.27568030 0.27713722 0.28616462 0.33223873 0.01730974 
    

    second model including also an interaction,

    second_model = glm_train_test(y, x, num_iters = 100, Formula = "y ~ X1 * X4 + X2 + X3", split_percentage = 0.5)
    

    example output for the second model,

    second_model
    
    $summary_aic
           Min.     1st Qu.      Median        Mean     3rd Qu.        Max.     std.dev 
    -0.04232767 13.37572489 16.92720680 16.81206625 20.79921756 29.67830243  5.96477080 
    
    $summary_rmse
          Min.    1st Qu.     Median       Mean    3rd Qu.       Max.    std.dev 
    0.23912727 0.27026791 0.28117878 0.28177088 0.29526944 0.32674985 0.01819308
    

    You can use an evaluation metric that is appropriate for your data and you can also have a look to the difference between 'aic' and 'bic' in another stackoverflow question.