amazon-web-services parquet aws-glue amazon-athena

AWS Athena - merge small parquet files or leave them?

I have a lot of small parquet files that are read via AWS Glue into Athena. I know that small parquet files (35k or so each due to the way the log outputs them) are not ideal but once they are read into the data catalog, does it matter anymore?

In other words, should I go through the exercise of merging all the small parquet files into more ideally sized files prior to loading into Athena?

Solution

You continue to pay a price for small files even after they've been registered with the data catalog. When you query a table based on many small files, Athena has to work harder to gather and stream all of the necessary data it needs to scan in order to answer your query. Although the amount of data you ultimately scan may be comparable, doing it on chunkier files results in less overhead for the query engine (presto).

Reference: https://docs.aws.amazon.com/athena/latest/ug/performance-tuning.html - note how it mentions S3 throttling might also bite you when you have lots of small files

Also, in the case of parquet files, the files may have an index that the query engine can use to skip scanning files, or jump to the right spots in a particular file. I believe the effectiveness of such indexes would be reduced on many small files.

It's easy enough to convert the small files into chunkier ones via a CTAS statement that I'd recommend doing it. In my experience, I can anecdotally see queries execute faster against my batched files.