Search code examples
rregressionglmh2o

How to iterate GLMs in H2O


My dataset looks like this:

rownum  a        b      y  x
1     |  A   |   a    |1 | a
2     |  B   |   a    |1 | a
3     |  C   |   a    |1 | a
4     |  D   |   a    |0 | b
5     |  E   |   a    |0 | a
6     |  F   |   a    |0 | b

I want to create many h2o.frames that are based on tissue identity. Like this:

a:

rownum  a        b     y    x
1     |  A   |   a    |1 | a
2     |  D   |   a    |0 | a
3     |  F   |   a    |0 | a

b:

rownum  a      b       y  x
1     |  B   |   a    |1 | b
2     |  C   |   a    |1 | b
3     |  E   |   a    |0 | b

While I am currently doing it manually, that becomes difficult when I add more tissues to the dataset.

I also want to then push those h2o.frames to h2o.glm and iteratively save the model.

"INSERT x NAME HERE" = h2o.glm(y = "y", x = 
c("a","b"), 
training_frame = ITERATE H2O FRAMES HERE, family = 'poisson')

and then save the model

INSERT x NAME HERE <- h2o.saveModel(object= INSERT x NAME 
HERE, force=TRUE)

I would appreciate any help or advice you might have. I do know about interaction terms in GLM, but would like to do this for now.


Solution

  • Since you did not provide the data directly, I copied your example from above as an R data.frame.

    library(h2o)
    h2o.init()
    
    # Example data as an R data.frame
    df <- data.frame(genes = c("A","B","C","D","E","F"),
                     samples = c("a","a","a","a","a","a"),
                     y = c(1,1,1,0,0,0),
                     tissue = c("Muscle","Brain","Brain","Muscle","Brain","Muscle"))
    
    # Convert R data.frame to H2OFrame
    hf <- as.h2o(df)
    

    However, I assume you have this data in a CSV on your computer, so in reality, what you'd do is this:

    # Load data from disk directly into H2O cluster
    hf <- h2o.importFile("tissue_samples.csv")
    

    Now that you have the data in an H2OFrame, there are only a few more steps:

    # List of unique tissue types
    tissue_types <- as.list(h2o.unique(hf$tissue))
    
    # Create list of frames (one for each tissue type)
    frames <- sapply(tissue_types, function(t) hf[(hf[,"tissue"] == t),])
    
    # Set up h2o.glm arguments
    x <- c("genes", "samples")
    y <- "y"
    
    # List of glms (one for each tissue type)
    glms <- sapply(frames, function(fr) h2o.glm(x = x, y = y, 
                           family = "poisson", training_frame = fr))
    
    # Save the models
    model_names <- sapply(glms, function(m) h2o.saveModel(m, path = "/Users/me/", force = TRUE))
    
    # Look at model names
    print(model_names)
    # [1] "/Users/me/GLM_model_R_1497937770060_222"
    # [2] "/Users/me/GLM_model_R_1497937770060_223"