I was wondering if there is a convenient way to train multiple h2o models from a nested data frame in R. Assume, we have a dataset with the following structure and I want to train one model for each Species:
dataset(iris)
iris_nested<-iris%>%
dplyr::mutate(dataset=dplyr::if_else(sample(1:nrow(iris))<100,"train","val"))%>%
dplyr::group_by(Species,dataset)%>%
tidyr::nest()%>%
tidyr::pivot_wider(names_from = dataset,values_from = data)
Is there a way of loading and using the dataset into h2o without building a loop to break up the nested list? I would like to avoid the step of creating h2o objects for each row.
Edit: For example to predict Sepal.Length with other numeric inputs, I would train a single model for row i with:
library(h2o)
h2o.init()
h2o_train<-as.h2o(iris_nested[["train"]][[i]])
h2o_val<-as.h2o(iris_nested[["val"]][[i]])
h2o_trainedmodel <- h2o.automl(
x = c("Sepal.Width","Petal.Length","Petal.Width"),
y = "Sepal.Length",
training_frame = h2o_train,
leaderboard_frame = h2o_val,
project_name = "run1")
Afterward, extract and save the trained model and generate a mapping table, so that I know which model belongs to which species.
With purrr you can embed it into the tibble, but let's say if you want to do prediction, you might need to use map2, abit more complicated than should be i think :
library(dplyr)
library(h2o)
library(purrr)
iris%>%
dplyr::mutate(dataset=dplyr::if_else(sample(1:nrow(iris))<100,"train","val"))%>%
dplyr::group_by(Species,dataset)%>%
tidyr::nest()%>%
tidyr::pivot_wider(names_from = dataset,values_from = data) %>%
mutate(model=map(train,~h2o.randomForest(y="Sepal.Width",
x=c("Sepal.Length","Petal.Width","Petal.Length"),training_frame=as.h2o(.x))))