Search code examples
pythoncsvparqueth2o

H2O parses files types differently


I am seeing strange behaviour between file types when parsing the data. I am using a small dataset (200 rows, 34 columns) that is in both parquet and CSV format.

When I parse the CSV file I can see that boolean values are correctly identified as an enum see below (male/female):

print(h2o_frame_csv.types)

{'C1': 'int', 'userId': 'int', 'itemId': 'int', 'rating': 'int', 'timestamp': 'int', 'movieId': 'int', 'movieTitle': 'enum', 'releaseDate': 'time', 'videoReleaseDate': 'int', 'imdbUrl': 'enum', 'unknown': 'int', 'Action': 'int', 'Adventure': 'int', 'Animation': 'int', 'Childrens': 'int', 'Comedy': 'int', 'Crime': 'int', 'Documentary': 'int', 'Drama': 'int', 'Fantasy': 'int', 'FilmNoir': 'int', 'Horror': 'int', 'Musical': 'int', 'Mystery': 'int', 'Romance': 'int', 'SciFi': 'int', 'Thriller': 'int', 'War': 'int', 'Western': 'int', 'age': 'int', 'occupation': 'enum', 'zipCode': 'int', 'male': 'enum', 'female': 'enum'}

However, when I used the parquet version of the files I am seeing the same values being treated as int values

print(h2o_frame_parquet.types)

{'Unnamed: 0': 'int', 'userId': 'int', 'itemId': 'int', 'rating': 'int', 'timestamp': 'int', 'movieId': 'int', 'movieTitle': 'enum', 'releaseDate': 'time', 'videoReleaseDate': 'int', 'imdbUrl': 'enum', 'unknown': 'int', 'Action': 'int', 'Adventure': 'int', 'Animation': 'int', 'Childrens': 'int', 'Comedy': 'int', 'Crime': 'int', 'Documentary': 'int', 'Drama': 'int', 'Fantasy': 'int', 'FilmNoir': 'int', 'Horror': 'int', 'Musical': 'int', 'Mystery': 'int', 'Romance': 'int', 'SciFi': 'int', 'Thriller': 'int', 'War': 'int', 'Western': 'int', 'age': 'int', 'occupation': 'enum', 'zipCode': 'int', 'male': 'int', 'female': 'int', '__index_level_0__': 'int'}

This becomes an issue when trying to train a classifier model. Some metrics are not available as h2o deems this be regressor rather than binomial. See below

print(f"For {file_type} dataset the metric class is {type(xgb.model_performance(xval=True))}")

For csv dataset the metric class is <class 'h2o.model.metrics_base.H2OBinomialModelMetrics'>

For parquet dataset the metric class is <class 'h2o.model.metrics_base.H2ORegressionModelMetrics'>

What is the reason for treating boolean values as numeric (int) when parsing the parquet file? Are booleans not considered categorical enums like in the CSV file?


Solution

  • [Revised based on comments and software updates from @MichalKurka below.]

    Parquet files include metadata about the column type. H2O-3 honors the metadata.

    In csv files, the column type is guessed.

    In H2O-3 versions 3.28.0.1 and higher, columns in a parquet dataset with a boolean type are treated as an enum value (aka categorical). Prior versions of H2O-3 treated a parquet boolean column as a numeric value.