How can I change the name of a column in a parquet file using Pyarrow?

I have several hundred parquet files created with PyArrow. Some of those files, however, have a field/column with a slightly different name (we'll call it Orange) than the original column (call it Sporange), because one used a variant of the query. Otherwise, the data (all the other fields, and all the data) is identical. In a database world, I'd do an ALTER TABLE and rename the column. However, I don't know how to do that with parquet/PyArrow

Is there a way to rename the column in the file, rather than having to regenerate or duplicate the file?

Alternatively, can I read it (read_table or ParquetFile,I assume), change the column in the object (unsure how to do that) and write it out?

I see "rename_columns", but unsure how that works; I tried just using it by itself, it says "rename_columns is not defined".

rename_columns(self, names) Create new table with columns renamed to provided names.

Many thanks!

Solution

I suspect you are using a version of pyarrow that doesn't support rename_columns. Can you run pa.__version__ to check?

Otherwise what you want to do is straightforward, in the example below I rename column b to c:

import pyarrow as pa
import pyarrow.parquet as pq

col_a = pa.array([1, 2, 3], pa.int32())
col_b = pa.array(["X", "Y", "Z"], pa.string())

table = pa.Table.from_arrays(
    [col_a, col_b],
    schema=pa.schema([
        pa.field('a', col_a.type),
        pa.field('b', col_b.type),
    ])
)

pq.write_table(table, '/tmp/original')
original = pq.read_table('/tmp/original')
renamed = original.rename_columns(['a', 'c'])
pq.write_table(renamed, '/tmp/renamed')