Search code examples
pythonpandasdaskparquetpyarrow

Why can Pyarrow read additional index column while Pandas dataframe cannot?


I have the following code:

import pandas as pd
import dask.dataframe as da
from pyarrow.parquet import ParquetFile


df = pd.DataFrame([1, 2, 3], columns=["value"])

my_dataset = da.from_pandas(df, chunksize=3)
save_dir = './local/'
my_dataset.to_parquet(save_dir)


pa = ParquetFile("./local/part.0.parquet")
print(pa.schema.names)

df2 = pd.read_parquet("./local/part.0.parquet")
print(df2.columns)

The output is:

['value', '__null_dask_index__']
Index(['value'], dtype='object')

Just curious, why did Pandas dataframe ignore __null_dask_index__ column name? Or is __null_dask_index__ not considered as a column?


Solution

  • pandas will read the __null_dask_index__ and use it (correctly) as an index, so it doesn't show up in the list of columns. To see this clearly, specify a custom index (e.g. 4,5,6) and then inspect the head of the df2 dataframe:

    from pandas import DataFrame
    from dask.dataframe import from_pandas
    from pyarrow.parquet import ParquetFile
    
    
    df = DataFrame([1, 2, 3], columns=["value"], index=[4,5,6])
    
    my_dataset = from_pandas(df, chunksize=2)
    save_dir = './local/'
    my_dataset.to_parquet(save_dir)
    
    
    pa = ParquetFile("./local/part.0.parquet")
    print(pa.schema.names)
    
    from pandas import read_parquet
    df2 = read_parquet("./local/part.0.parquet")
    print(df2.head())
    #                      value
    # __null_dask_index__       
    # 4                        1
    # 5                        2
    
    

    The parquet files created by dask and pandas (via arrow or fastparquet) contain a special metadata area specifying column and index attributes for use by pandas/dask, but arrow does not know about it by itself.