Search code examples
rdata-manipulationpredicth2oscoring

Keeping the ID Key (Or Any Other Column) When Scoring a New Data Set?


This is probably a dumb question, but when I use the H2O Predict function in R, I am wondering if there is a way that I can specify that it keep a column or columns from the scoring data. Specifically I want to keep my unique ID key.

As it stands now, I end up doing the really inefficient approach of assigning an index key to the original data set and one to the scores, then merging the scores to the scoring data set. I'd rather just say "score this data set and keep x,y,z....columns as well." Any advice?

Inefficient code:

#Use H2O predict function to score new data
NL2L_SCore_SetScored.hex = h2o.predict(object = best_gbm, newdata = 
NL2L_SCore_Set.hex)

#Convert scores hex to data frame from H2O
NL2L_SCore_SetScored.df<-as.data.frame(NL2L_SCore_SetScored.hex)
#add index to the scores so we can merge the two datasets
NL2L_SCore_SetScored.df$ID <- seq.int(nrow(NL2L_SCore_SetScored.df))



#Convert orignal scoring set to data frame from H2O
NL2L_SCore_Set.df<-as.data.frame(NL2L_SCore_Set.hex)
#add index to original scoring data so we can merge the two datasets
NL2L_SCore_Set.df$ID <- seq.int(nrow(NL2L_SCore_Set.df))


#Then merge by newly created ID Key so I have the scores on my scoring data 
#set. Ideally I wouldn't have to even create this key and could keep 
#original Columns from the data set, which include the customer id key

Full_Scored_Set=inner_join(NL2L_SCore_Set.df,NL2L_SCore_Set.df, by="ID" )

Solution

  • Rather than doing a join, you can simply column-bind the ID column on to the predict frame, since the prediction frame rows are in the same order.

    R Example (ignore the fact that I am predicting on the original training set, this is for demonstration purposes only):

    library(h2o)
    h2o.init()
    
    data(iris)
    iris$id <- 1:nrow(iris)  #add ID column
    iris_hf <- as.h2o(iris)  #convert iris to an H2OFrame
    
    fit <- h2o.gbm(x = 1:4, y = 5, training_frame = iris_hf)
    pred <- h2o.predict(fit, newdata = iris_hf)
    pred$id <- iris_hf$id
    head(pred)
    

    Now you have a prediction frame with the ID column:

      predict    setosa   versicolor    virginica id
    1  setosa 0.9989301 0.0005656447 0.0005042210  1
    2  setosa 0.9985183 0.0006462680 0.0008354416  2
    3  setosa 0.9989298 0.0005663071 0.0005038929  3
    4  setosa 0.9989310 0.0005660443 0.0005029535  4
    5  setosa 0.9989315 0.0005649384 0.0005035886  5
    6  setosa 0.9983457 0.0011517334 0.0005025218  6