Search code examples
pandasparquetpyarrowapache-arrow

pyarrow write dataset drops partition columns


I am trying to use pyarrow to partition and write parquet files

!pip install pyarrow==13.0.0

import pyarrow as pa
table = pa.table({'year': [2020, 2022, 2021, 2022, 2019, 2021],
              'n_legs': [2, 2, 4, 4, 5, 100],
              'animal': ["Flamingo", "Parrot", "Dog", "Horse",
                         "Brittle stars", "Centipede"]})

import pyarrow.parquet as pq
pq.write_to_dataset(table, root_path='dataset_name_3',
                partition_cols=['year'])
p_files = pq.ParquetDataset('dataset_name_3', use_legacy_dataset=False).files

import pandas as pd
pd.read_parquet(path=p_files[0])

OP:

   n_legs   animal
0   5      Brittle stars

As shown in the OP, after reading the partition_files, only 2 columns are in the Op - n_legs and animal. The column which I created the partition with year gets dropped.

Any suggestions to fix this?


Solution

  • You are saving the table as a partitioned dataset but reading a single parquet file. The single parquet file is only part of the dataset and thus does not have all the data. But the data is still there as the name of the partition dirs:

    ls dataset_name_3                                                                 
    'year=2019'  'year=2020'  'year=2021'  'year=2022'
    

    If you use the function as intended and not only to get the file name the data is there:

    >>> ds = pq.ParquetDataset('dataset_name_3')
    >>> ds.read().to_pandas()
       n_legs         animal  year
    0       5  Brittle stars  2019
    1       2       Flamingo  2020
    2       4            Dog  2021
    3     100      Centipede  2021
    4       2         Parrot  2022
    5       4          Horse  2022
    

    See the docs for ParquetDataset or the more general dataset for details and examples.