Search code examples
pythonpandasparquetpyarrow

Pyarrow: How to specify the dtype of partition keys in partitioned parquet datasets?


I would like to create a partitioned pyarrow dataset with strings as partition keys:

import pandas as pd
import pyarrow as pa
from pyarrow import parquet as pq

data = {'key': ['001', '001', '002', '002'],
        'value_1': [10, 20, 100, 200],
        'value_2': ['a', 'b', 'a', 'b']}

df = pd.DataFrame(data)
tbl = pa.Table.from_pandas(df)

# write data to partitioned dataset
pq.write_to_dataset(tbl, 'partitioned_data', partition_cols=['key'])

print(pa.__version__)
print(df)
print(tbl)
4.0.0

   key  value_1 value_2
0  001       10       a
1  001       20       b
2  002      100       a
3  002      200       b

pyarrow.Table
keys: string
values_1: int64
values_2: string

The partitioned dataset appears as follows in the filesystem:

partitioned_data
├── key=001
│   └── 9acb170b99d14f1eba72af3697c71b8c.parquet
└── key=002
    └── 836f365800f0449b956eb35de67bbc8c.parquet

Keys and values appear as folder names for the different partitions as expected. Now I import the partitioned data again:

# read the partitioned data and convert to DataFrame
imported_tbl = pq.read_table('partitioned_data')
imported_df = imported_tbl.to_pandas()

print(imported_tbl)
print(imported_df)
pyarrow.Table
value_1: int64
value_2: string
key: dictionary<values=int32, indices=int32, ordered=0>


   value_1 value_2 key
0       10       a   1
1       20       b   1
2      100       a   2
3      200       b   2

In the imported data, the dtype of 'key' has changed from string to dictionary<values=int32, resulting in incorrect values. In particular, the trailing zeros are lost ('001' becomes 1).

Is there any way to specify the dtype of the key, either on export or import to preserve the values?


Solution

  • With this file structure, there is no explicit metadata (or schema information) about the partition keys stored anywhere. So pq.read_table tries to guess the type. In your case (even with the trailing zeros) it can't guess it is a string and think key is an integer.

    You can use the dataset api in order to provide some information about the partition:

    my_partitioning = pa.dataset.partitioning(pa.schema([pa.field("key", pa.string())]), flavor='hive')
    my_data_set = pa.dataset.dataset("partitioned_data", partitioning=my_partitioning)
    table = my_data_set.to_table()
    table.to_pandas()
    
    |    |   value_1 | value_2   |   key |
    |---:|----------:|:----------|------:|
    |  0 |        10 | a         |   001 |
    |  1 |        20 | b         |   001 |
    |  2 |       100 | a         |   002 |
    |  3 |       200 | b         |   002 |