Search code examples
amazon-web-servicesamazon-s3botoboto3aws-cli

s3 - how to get fast line count of file? wc -l is too slow


Does anyone have a quick way of getting the line count of a file hosted in S3? Preferably using the CLI, s3api but I am open to python/boto as well. Note: solution must run non-interactively, ie in an overnight batch.

Right no i am doing this, it works but takes around 10 minutes for a 20GB file:

 aws cp s3://foo/bar - | wc -l

Solution

  • Here's two methods that might work for you...

    Amazon S3 has a new feature called S3 Select that allows you to query files stored on S3.

    You can perform a count of the number of records (lines) in a file and it can even work on GZIP files. Results may vary depending upon your file format.

    S3 Select

    Amazon Athena is also a similar option that might be suitable. It can query files stored in Amazon S3.