ignore columns not present in parquet with pyarrow in pandas

I am trying to read a parquet with pyarrow==1.0.1 as engine.

Given :

columns = ['a','b','c']    
pd.read_parquet(x, columns=columns, engine="pyarrow")

if file x does not contain c, it will give out :

/opt/anaconda3/.../lib/python3.6/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset.Dataset.to_table()

/opt/anaconda3/.../lib/python3.6/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset.Dataset._scanner()

    /opt/anaconda3/.../lib/python3.6/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset.Scanner.from_dataset()
    
    /opt/anaconda3/.../lib/python3.6/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset._populate_builder()
    
    /opt/anaconda3/.../lib/python3.6/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
    
    ArrowInvalid: Field named 'c' not found or not unique in the schema.

There is no argument to ignore warning and just read columns that are missing as nan.

The error handling is also pretty bad.

pyarrow.lib.ArrowInvalid("Field named 'c' not found or not unique in the schema.")

It is pretty hard to get the filed name that was missing, so that it can be used to remove the columns that is passed in next try.

Is there a method to this?

Solution

You can read the metadata from your parquet file to figure out which columns are available.

Bear in mind though that pandas won't be able to guess the type of the missing column (c in the example below), which may cause issues when you concatenate tables later.

import pandas as pd
import pyarrow.parquet as pq

all_columns = ['a', 'b', 'c']

df = pd.DataFrame({'a': [1, 2, 3], 'b': ['foo', 'bar', 'z']})
file_name = '/tmp/my_df.pq'
df.to_parquet(file_name)

parquet_file = pq.ParquetFile(file_name)
columns_in_file = [c for c in all_columns if c in parquet_file.schema.names]
df = (
    parquet_file
        .read(columns=columns_in_file)
        .to_pandas()
        .reindex(columns=all_columns)
)