Search code examples
pythonh2osparkling-water

Adding additional data to each row in an H2OFrame


I am working with a huge H2OFrame (~150gb, ~200 million rows), which I need to manipulate a little. To be more specific: I have to use the frame's ip column, to find the location/city names for each IP and add this information to each of the frame's rows.

Converting the frame to a plain python object and manipulating it locally is not an option, due to the huge size of the frame. So what I was hoping I could do is to use my H2O cluster to create a new H2OFrame city_names using the original frame's ip column and then merge both frames.

My question is kind of similar to the question posed here, and what I gathered from this question's answer is that there is no way in H2O to do complex manipulations of each of the frame's rows. Is that really the case? H2OFrame's apply function only accepts a lambda without custom methods after all.

One option I thought of was to use Spark/Sparkling Water for this kind of data manipulation and then convert the spark frame to an H2OFrame to do the machine learning operations. However, if possible I would prefer to avoid that and only use H2O, not least due to the overhead that such a conversion creates.

So I guess it comes down to this: Is there any way to do this kind of manipulation using only H2O? And if not is there another option to do this without having to change my cluster architecture (i.e. without having to turn my H2O cluster into a sparkling water cluster?)


Solution

  • Yes, when using apply with H2OFrame, you can not pass a function instead only lambda is accept. For example if you try passing tryit function you will get the following error showing the limitation:

    H2OValueError: Argument `fun` (= <function tryit at 0x108d66410>) does not satisfy the condition fun.__name__ == "<lambda>"
    

    As you already know Sparkling Water is another option to perform all the data munging first in spark and then push you data into H2O for ML.

    If you want to stick with H2O as it is, then your options are to just loop through the dataframe to process elements your way. The following option could be little time consuming depending on your data however it does not ask you to move your environment.

    • Create a new H2O frame by selecting your "ip" column only and add location, city, and other empty columns to it with NA.
    • Loop through all the ip values and based on "ip", find location/city and add location, city and other column values to the existing columns
    • Finally cbind the new h2oFrame with original H2OFrame
    • Check "ip" and "ip0" columns for proper merge with 100% match and then remove one of the duplicate "ip0" column.
    • Remove the other extra H2OFrame to save memory