Search code examples
machine-learningh2osparkling-water

Represent a list of items in input CSV for H2O


How do I represent a set/list of items in the input data (data frame) for H2O?

I'm using sparkling water 1.6.5 with H2O Flow. My input data (columns in the CSV file) look like this:

age: numeric
gender: enum
hobbies: ?
sports: ?

hobbies and sports are lists/sets with a limited number of possible entries (~20 each). H2O does not seem to have a suitable data type for this. How do I export these into a CSV file that can be processed by H2O Flow?


Solution

  • If you were just recording their main hobby, or main sport, then it would be a single enum column, e.g. hobbies, with 20 levels. You would simply write it as a string field in your csv file, and H2O would read it.

    But I think what you are after is where each person has 0+ choices from 20 hobbies? In that case you need to have 20 columns in your csv file, one per hobby; each will be a 2-value enum. It doesn't matter what the two values are: Y/N, T/F, Y/blank, hobby-name/blank, etc. Your csv file might look this:

    name,gender,football?,running?,data mining?,sleeping?
    Tom,M,Y,,,Y
    Dick,M,,,Y,
    Suzy,F,,Y,Y,
    

    Tom likes football and sleeping, Dick lives for data mining and nothing else, and Suzy is into running and data mining.

    By the way, if using deeplearning then it will end up with the same network configuration: a single 20-level enum input will be converted into 20 binary inputs nodes.