Search code examples
rh2o

How to train multiple h2o models on a nested data frame?


I was wondering if there is a convenient way to train multiple h2o models from a nested data frame in R. Assume, we have a dataset with the following structure and I want to train one model for each Species:

dataset(iris)
iris_nested<-iris%>%
  dplyr::mutate(dataset=dplyr::if_else(sample(1:nrow(iris))<100,"train","val"))%>%
  dplyr::group_by(Species,dataset)%>%
  tidyr::nest()%>%
  tidyr::pivot_wider(names_from = dataset,values_from = data)

enter image description here

Is there a way of loading and using the dataset into h2o without building a loop to break up the nested list? I would like to avoid the step of creating h2o objects for each row.

Edit: For example to predict Sepal.Length with other numeric inputs, I would train a single model for row i with:

library(h2o)
h2o.init()   
h2o_train<-as.h2o(iris_nested[["train"]][[i]])
h2o_val<-as.h2o(iris_nested[["val"]][[i]])

h2o_trainedmodel <- h2o.automl(
  x = c("Sepal.Width","Petal.Length","Petal.Width"), 
  y = "Sepal.Length",
  training_frame = h2o_train,
  leaderboard_frame = h2o_val,
  project_name = "run1")

Afterward, extract and save the trained model and generate a mapping table, so that I know which model belongs to which species.


Solution

  • With purrr you can embed it into the tibble, but let's say if you want to do prediction, you might need to use map2, abit more complicated than should be i think :

    library(dplyr)
    library(h2o)
    library(purrr)
    
    iris%>%
    dplyr::mutate(dataset=dplyr::if_else(sample(1:nrow(iris))<100,"train","val"))%>%
    dplyr::group_by(Species,dataset)%>%
    tidyr::nest()%>%
    tidyr::pivot_wider(names_from = dataset,values_from = data) %>%
    mutate(model=map(train,~h2o.randomForest(y="Sepal.Width",
    x=c("Sepal.Length","Petal.Width","Petal.Length"),training_frame=as.h2o(.x))))