My dataset looks like this:
rownum a b y x
1 | A | a |1 | a
2 | B | a |1 | a
3 | C | a |1 | a
4 | D | a |0 | b
5 | E | a |0 | a
6 | F | a |0 | b
I want to create many h2o.frames that are based on tissue identity. Like this:
a:
rownum a b y x
1 | A | a |1 | a
2 | D | a |0 | a
3 | F | a |0 | a
b:
rownum a b y x
1 | B | a |1 | b
2 | C | a |1 | b
3 | E | a |0 | b
While I am currently doing it manually, that becomes difficult when I add more tissues to the dataset.
I also want to then push those h2o.frames to h2o.glm and iteratively save the model.
"INSERT x NAME HERE" = h2o.glm(y = "y", x =
c("a","b"),
training_frame = ITERATE H2O FRAMES HERE, family = 'poisson')
and then save the model
INSERT x NAME HERE <- h2o.saveModel(object= INSERT x NAME
HERE, force=TRUE)
I would appreciate any help or advice you might have. I do know about interaction terms in GLM, but would like to do this for now.
Since you did not provide the data directly, I copied your example from above as an R data.frame.
library(h2o)
h2o.init()
# Example data as an R data.frame
df <- data.frame(genes = c("A","B","C","D","E","F"),
samples = c("a","a","a","a","a","a"),
y = c(1,1,1,0,0,0),
tissue = c("Muscle","Brain","Brain","Muscle","Brain","Muscle"))
# Convert R data.frame to H2OFrame
hf <- as.h2o(df)
However, I assume you have this data in a CSV on your computer, so in reality, what you'd do is this:
# Load data from disk directly into H2O cluster
hf <- h2o.importFile("tissue_samples.csv")
Now that you have the data in an H2OFrame, there are only a few more steps:
# List of unique tissue types
tissue_types <- as.list(h2o.unique(hf$tissue))
# Create list of frames (one for each tissue type)
frames <- sapply(tissue_types, function(t) hf[(hf[,"tissue"] == t),])
# Set up h2o.glm arguments
x <- c("genes", "samples")
y <- "y"
# List of glms (one for each tissue type)
glms <- sapply(frames, function(fr) h2o.glm(x = x, y = y,
family = "poisson", training_frame = fr))
# Save the models
model_names <- sapply(glms, function(m) h2o.saveModel(m, path = "/Users/me/", force = TRUE))
# Look at model names
print(model_names)
# [1] "/Users/me/GLM_model_R_1497937770060_222"
# [2] "/Users/me/GLM_model_R_1497937770060_223"