amazon-web-services amazon-s3 aws-lambda amazon-athena amazon-dynamodb-streams

Athena performance on too many S3 files

I am planning to store data into S3 on top of which SQL queries would later be executed. The S3 file would basically contain json records. I would be getting these records through DynamoDB streams triggering AWS Lambda execution so its difficult to handle duplication at that layer as AWS Lambda guarantees atleast once delivery. To avoid handling duplicate records in queries, I would like ensure that records being inserted at unique.

As far as I know, the only way to do achieve uniqueness is to have a unique S3 key. If I were to opt for this approach, I would end creating couple of million S3 files per day. Each file consists of single json record.

Would creating so many files be an concern when executing Athena queries? Any alternatives approaches?

Solution

I think you would be better off handling the deduplication in Athena itself. For Athena, weeding out a few duplicates will be an easy job. Set up a view that groups by the unique property and uses ARBITRARY or MAX_BY (if you have something to order by to pick the latest) for the non-unique properties, and run your queries against this view to not have to worry about deduplication in each individual query.

You could also run a daily or weekly deduplication job using CTAS, depending on how fresh the data has to be (you can also do complex hybrids with pre-deduplicated historical data union'ed with on-the-fly-deduplicated data).

When running a query Athena lists the objects on S3, and this is not a parallelizable operation (except for partitioned tables where it's parallelizable to the grain of the partitioning), and S3's listings are limited to a page size of 1000. You really don't want to have Athena queries against tables (or partitions) with more than 1000 files.