Search code examples
pandasdaskpyarrowfastparquet

Converting NaN floats to other types in Parquet format


I currently am processing a bunch of CSV files and transforming them into Parquet. I use these with Hive and query the files directly. I would like to switch over to Dask for my data processing. My data I am reading has optional columns some of which are Boolean types. I know Pandas does not support optional bool types at this time, but is there anyway to specify to either FastParquet or PyArrow what type I would like a field to be? I am fine with the data being a float64 in my DF, but can't have it as such in my Parquet store due to existing files already being an optional Boolean Type.


Solution

  • You should try using the fastparquet engine, and the following keyword argument

    object_encoding={'bool_col': 'bool'}
    

    Also, pandas does now allow boolean columns with nans as an extension type, but it is not yet exactly default. That should work directly.

    Example

    import fastparquet as fp
    df = pd.DataFrame({'a': [0, 1, 'nan']})
    fp.write('out.parq', df, object_encoding={'a': 'bool'})
    fp.write('out.parq', df.astype(float), object_encoding={'a': 'bool'})