Search code examples
pythonpysparkparquetpyarrow

Pyarrow.lib.Schema vs. pyarrow.parquet.Schema


When I try to load across a many-partitioned parquet file, some of the schema get inferred invalidly because of missing data which fills the schema in with nulls. I would think specifying the schema in the pyarrow.parquet.ParquetDataset would fix this but I don't know how to construct a schema of the correct pyarrow.parquet.Schema type. Some example code:

import pyarrow as pa
import pa.parquet as pq    
test_schema = pa.schema([pa.field('field1', pa.string()), pa.field('field2', pa.float64())])
paths = ['test_root/partition1/file1.parquet', 'test_root/partition2/file2.parquet']
dataset = pq.ParquetDataset(paths, schema=schema)

And the error:

AttributeError: 'pyarrow.lib.Schema' object has no attribute 'to_arrow_schema'

But I can't find any documentation on how to construct a pyarrow.parquet.Schema schema as in the docs (https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html) and have only made a pyarrow.lib.Schema which gives the above error.


Solution

  • There is not an API to construct a Parquet schema in Python yet. You can use one that you read from a particular file, though (see pq.ParquetFile(...).schema).

    Could you open an issue on the ARROW JIRA project to request the feature to construct Parquet schemas in Python?

    https://issues.apache.org/jira