Search code examples
pythonpandasparquetpyarrow

PyArrow: Store list of dicts in parquet using nested types


I want to store the following pandas data frame in a parquet file using PyArrow:

import pandas as pd
df = pd.DataFrame({'field': [[{}, {}]]})

The type of the field column is list of dicts:

      field
0  [{}, {}]

I first define the corresponding PyArrow schema:

import pyarrow as pa
schema = pa.schema([pa.field('field', pa.list_(pa.struct([])))])

Then I use from_pandas():

table = pa.Table.from_pandas(df, schema=schema, preserve_index=False)

This throws the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "table.pxi", line 930, in pyarrow.lib.Table.from_pandas
  File "/anaconda3/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 371, in dataframe_to_arrays
    convert_types)]
  File "/anaconda3/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 370, in <listcomp>
    for c, t in zip(columns_to_convert,
  File "/anaconda3/lib/python3.6/site-packages/pyarrow/pandas_compat.py", line 366, in convert_column
    return pa.array(col, from_pandas=True, type=ty)
  File "array.pxi", line 177, in pyarrow.lib.array
  File "error.pxi", line 77, in pyarrow.lib.check_status
  File "error.pxi", line 87, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: Unknown list item type: struct<>

Am I doing something wrong or is this not supported by PyArrow?

I use pyarrow 0.9.0, pandas 23.4, python 3.6.


Solution

  • According to this Jira issue, reading and writing nested Parquet data with a mix of struct and list nesting levels was implemented in version 2.0.0.

    The following example demonstrates the implemented functionality by doing a round trip: pandas data frame -> parquet file -> pandas data frame. PyArrow version used is 3.0.0.

    The initial pandas data frame has one filed of type list of dicts and one entry:

                      field
    0  [{'a': 1}, {'a': 2}]
    

    Example code:

    import pandas as pd
    import pyarrow as pa
    import pyarrow.parquet
    
    df = pd.DataFrame({'field': [[{'a': 1}, {'a': 2}]]})
    schema = pa.schema(
        [pa.field('field', pa.list_(pa.struct([('a', pa.int64())])))])
    table_write = pa.Table.from_pandas(df, schema=schema, preserve_index=False)
    pyarrow.parquet.write_table(table_write, 'test.parquet')
    table_read = pyarrow.parquet.read_table('test.parquet')
    table_read.to_pandas()
    

    The output data frame is the same as the input data frame, as it should be:

                      field
    0  [{'a': 1}, {'a': 2}]