Search code examples
parquetpyarrow

How can I change the name of a column in a parquet file using Pyarrow?


I have several hundred parquet files created with PyArrow. Some of those files, however, have a field/column with a slightly different name (we'll call it Orange) than the original column (call it Sporange), because one used a variant of the query. Otherwise, the data (all the other fields, and all the data) is identical. In a database world, I'd do an ALTER TABLE and rename the column. However, I don't know how to do that with parquet/PyArrow

Is there a way to rename the column in the file, rather than having to regenerate or duplicate the file?

Alternatively, can I read it (read_table or ParquetFile,I assume), change the column in the object (unsure how to do that) and write it out?

I see "rename_columns", but unsure how that works; I tried just using it by itself, it says "rename_columns is not defined".

rename_columns(self, names) Create new table with columns renamed to provided names.

Many thanks!


Solution

  • I suspect you are using a version of pyarrow that doesn't support rename_columns. Can you run pa.__version__ to check?

    Otherwise what you want to do is straightforward, in the example below I rename column b to c:

    import pyarrow as pa
    import pyarrow.parquet as pq
    
    col_a = pa.array([1, 2, 3], pa.int32())
    col_b = pa.array(["X", "Y", "Z"], pa.string())
    
    table = pa.Table.from_arrays(
        [col_a, col_b],
        schema=pa.schema([
            pa.field('a', col_a.type),
            pa.field('b', col_b.type),
        ])
    )
    
    pq.write_table(table, '/tmp/original')
    original = pq.read_table('/tmp/original')
    renamed = original.rename_columns(['a', 'c'])
    pq.write_table(renamed, '/tmp/renamed')