I have several hundred parquet files created with PyArrow. Some of those files, however, have a field/column with a slightly different name (we'll call it Orange) than the original column (call it Sporange), because one used a variant of the query. Otherwise, the data (all the other fields, and all the data) is identical. In a database world, I'd do an ALTER TABLE and rename the column. However, I don't know how to do that with parquet/PyArrow
Is there a way to rename the column in the file, rather than having to regenerate or duplicate the file?
Alternatively, can I read it (read_table or ParquetFile,I assume), change the column in the object (unsure how to do that) and write it out?
I see "rename_columns", but unsure how that works; I tried just using it by itself, it says "rename_columns is not defined".
rename_columns(self, names) Create new table with columns renamed to provided names.
Many thanks!
I suspect you are using a version of pyarrow
that doesn't support rename_columns
. Can you run pa.__version__
to check?
Otherwise what you want to do is straightforward, in the example below I rename column b to c:
import pyarrow as pa
import pyarrow.parquet as pq
col_a = pa.array([1, 2, 3], pa.int32())
col_b = pa.array(["X", "Y", "Z"], pa.string())
table = pa.Table.from_arrays(
[col_a, col_b],
schema=pa.schema([
pa.field('a', col_a.type),
pa.field('b', col_b.type),
])
)
pq.write_table(table, '/tmp/original')
original = pq.read_table('/tmp/original')
renamed = original.rename_columns(['a', 'c'])
pq.write_table(renamed, '/tmp/renamed')