Search code examples
h2o

H2O model wrongly treating field as numerical when was trained with enum type?


Having problem where an H2O DRF model is treating a field type as an int when the field type that was set when the model was being trained was an enum.

When using the H2O tree API to examine some of the individual trees in a trained DRF model, I can see that for some types that were explicitly set as enum when the model was trained (ie. the pandas dataframe was converted to an H2OFrame where certain fields were set to a particular type with a column_types map parameter), they appear to be being treated as ints when doing something like

root_node.features
> observe that the feature being examined for this node is one of the features set to be categorical enum by the H2OFrame that the model was trained on
tree.root_node.features
> some_categorical
tree.root_node.levels
> []
root_node.threshold
> some number

More compactly

print(tree.root_node)

Node ID 0 
Left child node ID = 1 Right child node ID = 2 
Splits on column some_categorical 
Split threshold < 2562.5 to the left node, >= 2562.5 to the right node 
NA values go to the LEFT

yet for other nodes (for the same model) we (correctly) see

tree.root_node.features
> some_other_categorical
tree.root_node.levels
> ['cat1', ..., 'catn']
root_node.threshold
> na

Initially I had thought that this just appeared to be treated as an int because of how categorical values are internally represented in H2O

enum or Enum: Leave the dataset as is, internally map the strings to integers, and use these integers to make splits - either via ordinal nature when nbins_cats is too small to resolve all levels or via bitsets that do a perfect group split. Each category is a separate category; its name (or number) is irrelevant. For example, after the strings are mapped to integers for Enum, you can split {0, 1, 2, 3, 4, 5} as {0, 4, 5} and {1, 2, 3}.

but looking at the fact that the informational output shows a greater-than threshold and no levels for determining left-right direction, you can see that there is some other problem here.

Examining the column_types map used in the pandas-to-H2OFrame conversion and printing the types as well before training the model, we can see that the appropriate types are being set as enum, so this output being seen now is confusing. Anyone know any other debugging steps that could be done here or what could be going on?


Solution

  • This is not a bug in the algorithm (the splits are still correct) but in the way H2O-3 represent splits in the MOJO Tree visualizer and the tree API. I've created a JIRA ticket that you can track here (or add to), which will ensure the MOJO Tree Visualizer and tree API splits are less confusing (i.e., using numeric splits or showing the list of categorical levels instead of both). The numeric splits you see correspond to our internal method for doing categorical splits.