Search code examples
pythonparquetdaskpyarrowfastparquet

A comparison between fastparquet and pyarrow?


After some searching I failed to find a thorough comparison of fastparquet and pyarrow.

I found this blog post (a basic comparison of speeds).

and a github discussion that claims that files created with fastparquet do not support AWS-athena (btw is it still the case?)

when/why would I use one over the other? what are the major advantages and disadvantages ?


my specific use case is processing data with dask writing it to s3 and then reading/analyzing it with AWS-athena.


Solution

  • In 2024 the decision should be obvious: use pyarrow instead of fastparquet:

    In our recent parquet benchmarking and resilience testing we generally found the pyarrow engine would scale to larger datasets better than the fastparquet engine, and more test cases would complete successfully when run with pyarrow than with fastparquet.

    The pyarrow library has a larger development team maintaining it and seems to have more community buy-in going forward.