Search code examples
openshifth2o

Get H2OFrame as object instead of getting reference to a location in H2O cluster


We have created and trained model using H2O libraries. Configured H2O in OpenShift container and deployed the trained model for getting realtime inference. It worked well when we have one container. We have to scale up to handle the increase in transaction volume. Encountered an issue with the statefull nature of the H2OFrame. Please see my sample code.

Step-1: Converts the JSON dictionary in to Pandas frame.
Step-2: Converts the Pandas frame in to H2O frame.
Step-3: Run the model with H2O frame as input.

Here step-2 is returning a handle to the data stored in the container. "H2OFrame is similar to pandas’ DataFrame, or R’s data.frame. One of the critical distinction is that the data is generally not held in memory, instead it is located on a (possibly remote) H2O cluster, and thus H2OFrame represents a mere handle to that data." So Step-3's request must go to the same container. If not it cannot find the H2O frame and throws error.

Step-1: convert JSON dictionary to data frame using Pandas dataFrame

 ToBeScored = pd.DataFrame([jsonDictionary])

Step-2: convert panda data frame to H2o frame

 ToBeScored_hex = h2o.H2OFrame(ToBeScored)

Step-3: run the model

 outPredections = rf_model.predict(ToBeScored_hex)

If the H2OFrame can be returned as an in memory object in step-2 then the statefull nature can be avoided. Is there any way? Or, Can the H2O clustering be configured to store the H2OFrame such a way that it can be accessible from any OpenShift container in the cluster?

Useful links
H2O’s Predict function accepts data only in H2OFrame format. Predict function - http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/model_categories.html#h2o.model.model_base.ModelBase.predict
H2O frame data type - http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/frame.html

Updated on 6/19/2019 continuation question to @ErinLeDell's clarification
We have upgraded to H2O 3.24 and used MOJO model. Removed step 2 and replaced step 3 with this function call.

import h2o as h 
result = h.mojo_predict_csv(input_csv_path="PredictionDataRow.csv",mojo_zip_path="rf_model.zip",
genmodel_jar_path="h2o-genmodel.jar", java_options='-Xmx512m -XX:ReservedCodeCacheSize=256m', verbose=True) 

Internally it executed the below command which initialized a new JVM and started H2O local server for every call. H2O local server is initialized to find the path to java.

java = H2OLocalServer._find_java()   // Find java path then creates below command line

C:\Program Files (x86)\Common Files\Oracle\Java\javapath\java.exe -Xmx512m -XX:ReservedCodeCacheSize=256m -cp h2o-genmodel.jar hex.genmodel.tools.PredictCsv --mojo C:\Users\admin\Documents\Code\python\rf_model.zip --input PredictionDataRow.csv --output C:\Users\admin\Documents\Code\python\prediction.csv --decimal 

Question-1: Is there any way to use an existing JVM and not always spawn a new one for every transaction? 
Question-2: Is there a way to pass the java path to avoid the H2O local server initialization? Is H2OLocalServer required for anything other than finding java path? If it cannot be avoided then, Is it possible to initialize local server once and direct new requests to existing H2O local server instead of starting a new H2O local server?


Solution

  • An alternative is to use an H2O MOJO model (instead of a binary model which needs to exist in H2O cluster memory to make predictions). MOJO models can sit on disk and do not require a running H2O cluster. Then you can skip Step 2 and use the h2o.mojo_predict_pandas() function in Step 3.