Search code examples
rmachine-learningcross-validationh2o

Cross-Validation Across models in h2o in R


I am planning to run glm, lasso and randomForest across different sets of predictors to see which model combination is the best. I am going to be doing v-fold cross validation. To compare the ML algorithms consistently, the same fold has to be fed into each of the ML algorithms. Correct me if I am wrong here.

How can we achieve that in h2o package in R? Should I set

  • fold_assignment = Modulo within each algo function such as h2o.glm(), h2o.randomForest() etc.
  • Hence, would the training set be split the same way across the ML algos?

If I use fold_assignment = Modulo and what if I have to stratify my outcome? The stratification option is with fold_assignment parameter as well? I am not sure I can specify Modulo and and Stratified both at the same time.

Alternatively, if I set the same seed in each of the model, would they have the same folds as input?

I have the above questions after reading Chapter 4 from [Practical Machine Learning with H2O by Darren Cook] (https://www.oreilly.com/library/view/practical-machine-learning/9781491964590/ch04.html)

Further, for generalizability in site level data in a scenario as in the quotation below:

For example, if you have observations (e.g., user transactions) from K cities and you want to build models on users from only K-1 cities and validate them on the remaining city (if you want to study the generalization to new cities, for example), you will need to specify the parameter “fold_column” to be the city column. Otherwise, you will have rows (users) from all K cities randomly blended into the K folds, and all K cross-validation models will see all K cities, making the validation less useful (or totally wrong, depending on the distribution of the data). (source)

In that case, since we are cross folding by a column, it would be consistent across all the different models, right?


Solution

  • Make sure you split the dataset the same for all ML algos (same seed). Having the same seed for each model won't necessarily have the same cross validation assignments. To ensure they are apples-to-apples comparisons, create a fold column (.kfold_column() or .stratified_kfold_column()) and specify it during training so they all use the same fold assignment.