import pyarrow.parquet as pq
sch = pq.read_schema(path+filename, memory_map=True)
But this does not work for hive partitioned files.
Tried adding the
partitioning='hive'
option, but it is not implemented.
How do I get the columns / schema of such a file?
You can use pyarrow.parquet.ParquetDataset.schema
:
import pyarrow as pa
import pyarrow.parquet as pq
table = pa.table(
{
"col1": pa.array(['a', 'a', 'b'], pa.string()),
"col2": pa.array([1, 2, 3], pa.int32()),
}
)
pq.write_to_dataset(
table,
"./dataset",
['col1']
)
schema = pq.ParquetDataset("./dataset").schema
But you may have bad surprises because write_to_dataset
doesn't write any meta data. So ParquetDataset
has to guess the schema from the first parquet file it can find. It also has a hard time figuring out the type of the partition columns.