Search code examples
pandasamazon-s3parquetpyarrowapache-arrow

fastest method for reading parquet from S3


I have a Parquet file in AWS S3. I would like to read it into a Pandas DataFrame. There are two ways for me to accomplish this.

1)
import pyarrow.parquet as pq
table = pq.read_table("s3://tpc-h-parquet/lineitem/part0.snappy.parquet") (takes 1 sec)
pandas_table = table.to_pandas() ( takes 1 sec !!! )
2)
import pandas as pd
table = pd.read_parquet("s3://tpc-h-parquet/lineitem/part0.snappy.parquet") (takes 2 sec)

I suspect option 2 is really just doing option 1 under the hood anyways.

What is the fastest way for me to read a Parquet file into Pandas?


Solution

  • You are correct. Option 2 is just option 1 under the hood.

    What is the fastest way for me to read a Parquet file into Pandas?

    Both option 1 and option 2 are probably good enough. However, if you are trying to shave off every bit you may need to go one layer deeper, depending on your pyarrow version. It turns out that Option 1 is actually also just a proxy, in this case to the datasets API:

    import pyarrow.dataset as ds
    dataset = ds.dataset("s3://tpc-h-parquet/lineitem/part0.snappy.parquet")
    table = dataset.to_table(use_threads=True)
    df = table.to_pandas()
    

    For pyarrow versions >= 4 and < 7 you can usually get slightly better performance on S3 using the asynchronous scanner:

    import pyarrow.dataset as ds
    dataset = ds.dataset("s3://tpc-h-parquet/lineitem/part0.snappy.parquet")
    table = dataset.to_table(use_threads=True, use_async=True)
    df = table.to_pandas()
    

    In pyarrow version 7 the asynchronous scanner is the default so you can once again simply use pd.read_parquet("s3://tpc-h-parquet/lineitem/part0.snappy.parquet")