Search code examples
pythonpandastypesnullparquet

Retain None in pandas DataFrame (in spite of astype() and to_parquet())


How can I force a pandas DataFrame to retain None values, even when using astype()?

Details

Since the pd.DataFrame constructor offers no compound dtype parameter, I fix the types (required for to_parquet()) with the following function:

def _typed_dataframe(data: list) -> pd.DataFrame:
    typing = {
        'name': str,
        'value': np.float64,
        'info': str,
        'scale': np.int8,
    }    
    result = pd.DataFrame(data)
    for label in result.keys():
        result[label] = result[label].astype(typing[label])
    return result

Unfortunately, result[info] = result[info].astype(str) transforms all None values in info to "None" strings. How can I forbid this, i.e. retain None values?

To be more precise: None values in data become np.nan in the result DataFrame, which become "nan" by astype(str), which become "None" when extracted from result.


Solution

  • Following @frosty's comment, we can use the alternative

        typing = {
            'name': str,
            'value': np.float64,
            'info': pd.StringDtype(),
            'scale': np.int8,
        }    
    

    However, this requires pandas ~= 1.0.0.


    As better solution, you can replace

    for label in result.keys():
        result[label] = result[label].astype(typing[label])
    

    by

    result.astype(schema)
    

    Unfortunately, result.astype(typing) has no effect since it cannot handle compound types.