Search code examples
pythonpandasparquet

Purpose of "pandas metadata" in Parquet file


If you write a pandas DataFrame to parquet file (using the .to_parquet(..) method), it will produce a bunch of metadata in the parquet footer. This is documented on the pandas site. The metadata includes things like index names and ranges, column names and datatypes, etc.

My question is - why is this useful for pandas? The column name and datatype information is already in the parquet schema.

I'm putting some business/domain specific metadata into the parquet file, and I suppose the "real question" is whether I should retain pandas metadata.


Solution

  • According to the doc:

    So that a pandas.DataFrame can be faithfully reconstructed, we store a pandas metadata key in the FileMetaData

    There are some pandas concepts that parquet doesn't support out of the box. For example index or categorical dtype.

    Take this table for example:

    import pandas as pd
    
    df = pd.concat(
        [
            pd.Series([1, 2], name='id'),
            pd.Series(['a', 'b'], name='value')
        ],
        axis=1
    ).set_index('id')
    
    id value
    1 a
    2 b

    If you save it to parquet and load it back:

    df.to_parquet("hello.parquet")
    
    table = pq.read_table('hello.parquet')
    
    table_as_df = table.to_pandas()
    

    You get the same dataframe back:

    id value
    1 a
    2 b

    And you can check with:

    pd.testing.assert_frame_equal(table_as_df, df)
    

    But if you strip the metadata:

    table_as_df_no_metadata = table.replace_schema_metadata({}).to_pandas()
    

    The index becomes a normal column: | | value | id | |---:|:--------|-----:| | 0 | a | 1 | | 1 | b | 2 |

    And the dataframes are not the same anymore:

    pd.testing.assert_frame_equal(table_as_df_no_metadata, df)
    

    Throws:

    AssertionError: DataFrame are different
    
    DataFrame shape mismatch
    [left]:  (2, 2)
    [right]: (2, 1)
    

    the "real question" is whether I should retain pandas metadata.

    The answer is it depends. If you are using concepts that map one to one from parquet to pandas (eg: don't use index), then you should be fine stripping the pandas metadata, but if not you may have some bad surprises.