I would like to create a partitioned pyarrow dataset with strings as partition keys:
import pandas as pd
import pyarrow as pa
from pyarrow import parquet as pq
data = {'key': ['001', '001', '002', '002'],
'value_1': [10, 20, 100, 200],
'value_2': ['a', 'b', 'a', 'b']}
df = pd.DataFrame(data)
tbl = pa.Table.from_pandas(df)
# write data to partitioned dataset
pq.write_to_dataset(tbl, 'partitioned_data', partition_cols=['key'])
print(pa.__version__)
print(df)
print(tbl)
4.0.0
key value_1 value_2
0 001 10 a
1 001 20 b
2 002 100 a
3 002 200 b
pyarrow.Table
keys: string
values_1: int64
values_2: string
The partitioned dataset appears as follows in the filesystem:
partitioned_data
├── key=001
│ └── 9acb170b99d14f1eba72af3697c71b8c.parquet
└── key=002
└── 836f365800f0449b956eb35de67bbc8c.parquet
Keys and values appear as folder names for the different partitions as expected. Now I import the partitioned data again:
# read the partitioned data and convert to DataFrame
imported_tbl = pq.read_table('partitioned_data')
imported_df = imported_tbl.to_pandas()
print(imported_tbl)
print(imported_df)
pyarrow.Table
value_1: int64
value_2: string
key: dictionary<values=int32, indices=int32, ordered=0>
value_1 value_2 key
0 10 a 1
1 20 b 1
2 100 a 2
3 200 b 2
In the imported data, the dtype of 'key' has changed from string to dictionary<values=int32, resulting in incorrect values. In particular, the trailing zeros are lost ('001' becomes 1).
Is there any way to specify the dtype of the key, either on export or import to preserve the values?
With this file structure, there is no explicit metadata (or schema information) about the partition keys stored anywhere. So pq.read_table
tries to guess the type. In your case (even with the trailing zeros) it can't guess it is a string and think key is an integer.
You can use the dataset
api in order to provide some information about the partition:
my_partitioning = pa.dataset.partitioning(pa.schema([pa.field("key", pa.string())]), flavor='hive')
my_data_set = pa.dataset.dataset("partitioned_data", partitioning=my_partitioning)
table = my_data_set.to_table()
table.to_pandas()
| | value_1 | value_2 | key |
|---:|----------:|:----------|------:|
| 0 | 10 | a | 001 |
| 1 | 20 | b | 001 |
| 2 | 100 | a | 002 |
| 3 | 200 | b | 002 |