Search code examples
pythonpandasdataframeparquetpyarrow

Can I store a Parquet file with a dictionary column having mixed types in their values?


I am trying to store a Python Pandas DataFrame as a Parquet file, but I am experiencing some issues. One of the columns of my Pandas DF contains dictionaries as such:

import pandas as pandas

df = pd.DataFrame({
    "ColA": [1, 2, 3],
    "ColB": ["X", "Y", "Z"],
    "ColC": [
        { "Field": "Value" },
        { "Field": "Value2" },
        { "Field": "Value3" }
    ]
})

df.to_parquet("test.parquet")

Now, that works perfectly fine, the problem is when one of the nested values of the dictionary has a different type than the rest. For instance:

import pandas as pandas

df = pd.DataFrame({
    "ColA": [1, 2, 3],
    "ColB": ["X", "Y", "Z"],
    "ColC": [
        { "Field": "Value" },
        { "Field": "Value2" },
        { "Field": ["Value3"] }
    ]
})

df.to_parquet("test.parquet")

This throws the following error:

ArrowInvalid: ('cannot mix list and non-list, non-null values', 'Conversion failed for column ColC with type object')

Notice how, for the last row of the DF, the Field property of the ColC dictionary is a list instead of a string.

Is there any workaround to be able to store this DF as a Parquet file?


Solution

  • ColC is a UDT (user defined type) with one field called Field of type Union of String, List of String.

    In theory arrow supports it, but in practice it has a hard time figuring out what the type of ColC is. Even if you were providing the schema of your data frame explicitly, it wouldn't work because this type of conversion (converting unions from pandas to arrow/parquet) isn't supported yet.

    union_type = pa.union(
        [pa.field("0",pa.string()), pa.field("1", pa.list_(pa.string()))],
        'dense'
    )
    col_c_type = pa.struct(
        [
            pa.field('Field', union_type)
        ]
    )
    
    schema=pa.schema(
        [
            pa.field('ColA', pa.int32()),
            pa.field('ColB', pa.string()),
            pa.field('ColC', col_c_type),
        ]
    )
    
    df = pd.DataFrame({
        "ColA": [1, 2, 3],
        "ColB": ["X", "Y", "Z"],
        "ColC": [
            { "Field": "Value" },
            { "Field": "Value2" },
            { "Field": ["Value3"] }
        ]
    })
    
    pa.Table.from_pandas(df, schema)
    

    This gives you this error:

    ('Sequence converter for type union[dense]<0: string=0, 1: list<item: string>=1> not implemented', 'Conversion failed for column ColC with type object'

    Even if you create the arrow table manually it won't be able to convert it to parquet (again, union are not supported).

    import io
    import pyarrow.parquet as pq
    
    col_a = pa.array([1, 2, 3], pa.int32())
    col_b = pa.array(["X", "Y", "Z"], pa.string())
    
    xs = pa.array(["Value", "Value2", None], type=pa.string())
    ys = pa.array([None, None, ["value3"]], type=pa.list_(pa.string()))
    types = pa.array([0, 0, 1], type=pa.int8())
    
    col_c = pa.UnionArray.from_sparse(types, [xs, ys])
    
    table = pa.Table.from_arrays(
        [col_a, col_b, col_c],
        schema=pa.schema([
            pa.field('ColA', col_a.type),
            pa.field('ColB', col_b.type),
            pa.field('ColC', col_c.type),
        ])
    )
    
    with io.BytesIO() as buffer:
        pq.write_table(table, buffer)
    
    Unhandled type for Arrow to Parquet schema conversion: sparse_union<0: string=0, 1: list<item: string>=1>
    

    I think your only option for now it to use a struct where fields have got different names for string value and list of string values.

    df = pd.DataFrame({
        "ColA": [1, 2, 3],
        "ColB": ["X", "Y", "Z"],
        "ColC": [
            { "Field1": "Value" },
            { "Field1": "Value2" },
            { "Field2": ["Value3"] }
        ]
    })
    
    df.to_parquet('/tmp/hello')