Having problem where an H2O DRF model is treating a field type as an int
when the field type that was set when the model was being trained was an enum
.
When using the H2O tree API to examine some of the individual trees in a trained DRF model, I can see that for some types that were explicitly set as enum
when the model was trained (ie. the pandas dataframe was converted to an H2OFrame
where certain fields were set to a particular type with a column_types
map parameter), they appear to be being treated as int
s when doing something like
root_node.features
> observe that the feature being examined for this node is one of the features set to be categorical enum by the H2OFrame that the model was trained on
tree.root_node.features
> some_categorical
tree.root_node.levels
> []
root_node.threshold
> some number
More compactly
print(tree.root_node)
Node ID 0
Left child node ID = 1 Right child node ID = 2
Splits on column some_categorical
Split threshold < 2562.5 to the left node, >= 2562.5 to the right node
NA values go to the LEFT
yet for other nodes (for the same model) we (correctly) see
tree.root_node.features
> some_other_categorical
tree.root_node.levels
> ['cat1', ..., 'catn']
root_node.threshold
> na
Initially I had thought that this just appeared to be treated as an int because of how categorical values are internally represented in H2O
enum or Enum: Leave the dataset as is, internally map the strings to integers, and use these integers to make splits - either via ordinal nature when nbins_cats is too small to resolve all levels or via bitsets that do a perfect group split. Each category is a separate category; its name (or number) is irrelevant. For example, after the strings are mapped to integers for Enum, you can split {0, 1, 2, 3, 4, 5} as {0, 4, 5} and {1, 2, 3}.
but looking at the fact that the informational output shows a greater-than threshold and no levels for determining left-right direction, you can see that there is some other problem here.
Examining the column_types
map used in the pandas-to-H2OFrame conversion and printing the types as well before training the model, we can see that the appropriate types are being set as enum
, so this output being seen now is confusing. Anyone know any other debugging steps that could be done here or what could be going on?
This is not a bug in the algorithm (the splits are still correct) but in the way H2O-3 represent splits in the MOJO Tree visualizer and the tree API. I've created a JIRA ticket that you can track here (or add to), which will ensure the MOJO Tree Visualizer and tree API splits are less confusing (i.e., using numeric splits or showing the list of categorical levels instead of both). The numeric splits you see correspond to our internal method for doing categorical splits.