Search code examples
parquetpyarrowapache-arrow

How to change column datatype with pyarrow


I am reading a set of arrow files and am writing them to a parquet file:

import pathlib
from pyarrow import parquet as pq
from pyarrow import feather
import pyarrow as pa

base_path = pathlib.Path('../mydata')

fields = [
    pa.field('value', pa.int64()),
    pa.field('code', pa.dictionary(pa.int32(), pa.uint64(), ordered=False)),
]
schema = pa.schema(fields)

with pq.ParquetWriter('sample.parquet', schema) as pqwriter:
    for file_path in base_path.glob('*.arrow'):
        table = feather.read_table(file_path)
        pqwriter.write_table(table)

My problem is that the code field in the arrow files is defined with an int8 index instead of int32. The range of int8 however is insufficient. Hence I defined a schema with a int32 index for the field code in the parquet file.

However, writing the arrow table to parquet now complains that the schemas do not match.

How can I change the datatype of the arrow column? I checked the pyarrow API and did not find a way to change the schema. Can this be done without roundtripping to pandas?


Solution

  • Arrow ChunkedArray has got a cast function, but unfortunately it doesn't work for what you want to do:

    >>> table['code'].cast(pa.dictionary(pa.int32(), pa.uint64(), ordered=False))
    Unsupported cast from dictionary<values=uint64, indices=int8, ordered=0> to dictionary<values=uint64, indices=int32, ordered=0> (no available cast function for target type)
    

    Instead you can cast to pa.uint64() and encode it to dictionary:

    >>> table['code'].cast(pa.uint64()).dictionary_encode().type
    DictionaryType(dictionary<values=uint64, indices=int32, ordered=0>)
    

    Here's a self contained example:

    import pyarrow as pa
    
    source_schema = pa.schema([
        pa.field('value', pa.int64()),
        pa.field('code', pa.dictionary(pa.int8(), pa.uint64(), ordered=False)),
    ])
    
    source_table = pa.Table.from_arrays([
        pa.array([1, 2, 3], pa.int64()),
        pa.array([1, 2, 1000], pa.dictionary(pa.int8(), pa.uint64(), ordered=False)),
    
    ], schema=source_schema)
    
    destination_schema = pa.schema([
        pa.field('value', pa.int64()),
        pa.field('code', pa.dictionary(pa.int32(), pa.uint64(), ordered=False)),
    ])
    
    destination_data = pa.Table.from_arrays([
        source_table['value'],
        source_table['code'].cast(pa.uint64()).dictionary_encode(),
    ], schema=destination_schema)