Search code examples
rperformancefor-loopmachine-learninggam

Tune a GAM model with a for loop


I need to perform GAM on the variable "Life_expectancy" using the three variables: "Adult_Mortality", "HIV_AIDS" and "Schooling". In order to optimally tune the GAM model, I need to find the perfect combination of degrees of freedom for each variable. To do that I need to create one for loop inside another to find the optimal combination of all variabes e.g. run the following command inside 3 for loops , one for i, one of j and one for k :

gam.fit <- gam(Life_expectancy ~ + s(Adult_Mortality, df = i) + s(HIV_AIDS, df = j) + s(Schooling, df = k), data=train)

for each combination of i,j,k and calculate the test error each time. In the end, choose the model with the lowest test error. I tried doing this with this code:

test.err <- rep(0, 8)
for (i in 3:10) {
  for (j in 3:10) {
    for (k in 3:10) {
  gam.fit <- gam(Life_expectancy ~ + s(Adult_Mortality, df = i) + 
                 s(HIV_AIDS, df = j) + 
                 s(Schooling, df = k),
               data=train)
  gam.pred <- predict(gam.fit, test)
  test.err[i-2] <- mean((test$Life_expectancy - gam.pred)^2)
    }}}

but this only yields 8 test errors for degrees of freedom i from 3 to 10. How can I output degrees of freedom for every combination of i,j,k?


Solution

  • The code can be modified to:

    test.err <- array(0, c(8,8,8))
    for (i in 3:10) {
      for (j in 3:10) {
        for (k in 3:10) {
      gam.fit <- gam(Life_expectancy ~ + s(Adult_Mortality, df = i) + 
                     s(HIV_AIDS, df = j) + 
                     s(Schooling, df = k),
                   data=train)
      gam.pred <- predict(gam.fit, test)
      test.err[i-2, j-2, k-2] <- mean((test$Life_expectancy - gam.pred)^2)
        }}}
    

    A couple of notes about the method:

    1. You haven't said which gam function you've used, there are functions in packages gam and mgcv and probably others. The latter can estimate appropriate degrees of freedom based on the training set
    2. You seem to be estimating degrees of freedom based on the fit to the test dataset, which to some extent goes against the idea of having a separate training and test dataset.