I am trying to use pyarrow
to partition and write parquet
files
!pip install pyarrow==13.0.0
import pyarrow as pa
table = pa.table({'year': [2020, 2022, 2021, 2022, 2019, 2021],
'n_legs': [2, 2, 4, 4, 5, 100],
'animal': ["Flamingo", "Parrot", "Dog", "Horse",
"Brittle stars", "Centipede"]})
import pyarrow.parquet as pq
pq.write_to_dataset(table, root_path='dataset_name_3',
partition_cols=['year'])
p_files = pq.ParquetDataset('dataset_name_3', use_legacy_dataset=False).files
import pandas as pd
pd.read_parquet(path=p_files[0])
OP:
n_legs animal
0 5 Brittle stars
As shown in the OP, after reading the partition_files
, only 2 columns are in the Op - n_legs
and animal
. The column which I created the partition with year
gets dropped.
Any suggestions to fix this?
You are saving the table as a partitioned dataset but reading a single parquet file. The single parquet file is only part of the dataset and thus does not have all the data. But the data is still there as the name of the partition dirs:
ls dataset_name_3
'year=2019' 'year=2020' 'year=2021' 'year=2022'
If you use the function as intended and not only to get the file name the data is there:
>>> ds = pq.ParquetDataset('dataset_name_3')
>>> ds.read().to_pandas()
n_legs animal year
0 5 Brittle stars 2019
1 2 Flamingo 2020
2 4 Dog 2021
3 100 Centipede 2021
4 2 Parrot 2022
5 4 Horse 2022
See the docs for ParquetDataset or the more general dataset for details and examples.