Search code examples
pythonparquet

How to read columns of a hive partitioned parquet file in python?



Generally you read the schema of a parquet file like:
import pyarrow.parquet as pq
sch = pq.read_schema(path+filename, memory_map=True)

But this does not work for hive partitioned files.
Tried adding the

partitioning='hive'

option, but it is not implemented.
How do I get the columns / schema of such a file?


Solution

  • You can use pyarrow.parquet.ParquetDataset.schema:

    import pyarrow as pa
    import pyarrow.parquet as pq
    
    table = pa.table(
        {
            "col1": pa.array(['a', 'a', 'b'], pa.string()),
            "col2": pa.array([1, 2, 3], pa.int32()),
        }
    )
    
    pq.write_to_dataset(
        table,
        "./dataset",
        ['col1']
    )
    
    schema = pq.ParquetDataset("./dataset").schema
    

    But you may have bad surprises because write_to_dataset doesn't write any meta data. So ParquetDataset has to guess the schema from the first parquet file it can find. It also has a hard time figuring out the type of the partition columns.