Search code examples
pythonpython-3.xparquetpyarrow

How to provide parquet schema while writing parquet file using PyArrow


I have a raw input csv data where all the fields are of string type. I want to convert this csv to parquet format. However on conversion to parquet I want to write it by providing a custom schema to the data. I am using PyArrow for csv to parquet conversion.

How can I provide a custom schema while writing the file to parquet using PyArrow?

Here is the code I used:

import pyarrow as pa 
import pyarrow.parquet as pq

# records is a list of lists containing the rows of the csv
table = pa.Table.from_pylist(records)
pq.write_table(table,"sample.parquet")

Solution

  • Could you give an example of records? If I try tu use a list of lists as suggested fails:

    >>> pa.Table.from_pylist([["1", "2"], ["first", "second"]])
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "pyarrow/table.pxi", line 3682, in pyarrow.lib.Table.from_pylist
        return _from_pylist(cls=Table,
      File "pyarrow/table.pxi", line 5199, in pyarrow.lib._from_pylist
        names = list(mapping[0].keys())
    AttributeError: 'list' object has no attribute 'keys'
    

    I would expect records to be a list of dicts from the documentation.

        data = [{'strs': '', 'floats': 4.5},
            {'strs': 'foo', 'floats': 5},
            {'strs': 'bar', 'floats': None}]
        table = pa.Table.from_pylist(data)
    

    You can use the schema when building the table from py_list, on this case:

    schema = pa.schema([('a', pa.int64()),
                        ('c', pa.int32()),
                        ('d', pa.int16())
                        ])
    table = pa.Table.from_pylist(
        [{'a': 1, 'b': 3}, {'a': 2, 'b': 4}, {'a': 3, 'b': 5}],
        schema=schema
    )
    data = [{'a': 1, 'c': None, 'd': None},
            {'a': 2, 'c': None, 'd': None},
            {'a': 3, 'c': None, 'd': None}]
    assert table.schema == schema
    assert table.to_pylist() == data