Search code examples
pythonparquetpyarrow

ignore columns not present in parquet with pyarrow in pandas


I am trying to read a parquet with pyarrow==1.0.1 as engine.

Given :

columns = ['a','b','c']    
pd.read_parquet(x, columns=columns, engine="pyarrow")

if file x does not contain c, it will give out :

/opt/anaconda3/.../lib/python3.6/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset.Dataset.to_table()

/opt/anaconda3/.../lib/python3.6/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset.Dataset._scanner()

    /opt/anaconda3/.../lib/python3.6/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset.Scanner.from_dataset()
    
    /opt/anaconda3/.../lib/python3.6/site-packages/pyarrow/_dataset.pyx in pyarrow._dataset._populate_builder()
    
    /opt/anaconda3/.../lib/python3.6/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
    
    ArrowInvalid: Field named 'c' not found or not unique in the schema.

There is no argument to ignore warning and just read columns that are missing as nan.

The error handling is also pretty bad.

pyarrow.lib.ArrowInvalid("Field named 'c' not found or not unique in the schema.")

It is pretty hard to get the filed name that was missing, so that it can be used to remove the columns that is passed in next try.

Is there a method to this?


Solution

  • You can read the metadata from your parquet file to figure out which columns are available.

    Bear in mind though that pandas won't be able to guess the type of the missing column (c in the example below), which may cause issues when you concatenate tables later.

    import pandas as pd
    import pyarrow.parquet as pq
    
    all_columns = ['a', 'b', 'c']
    
    df = pd.DataFrame({'a': [1, 2, 3], 'b': ['foo', 'bar', 'z']})
    file_name = '/tmp/my_df.pq'
    df.to_parquet(file_name)
    
    parquet_file = pq.ParquetFile(file_name)
    columns_in_file = [c for c in all_columns if c in parquet_file.schema.names]
    df = (
        parquet_file
            .read(columns=columns_in_file)
            .to_pandas()
            .reindex(columns=all_columns)
    )