Search code examples
amazon-web-servicesamazon-s3apache-drill

Apache Drill unusably slow with S3 data source?


I am trying to use Apache Drill with an S3 bucket, but it is incredibly slow.

I have about 20,000 JSON files. I can get results from them locally in a few seconds, e.g.:

> select count(*) from dfs.`/path/to/my/files/*.json`;

returns after less than 2 seconds.

Trying to run the exact same query on the exact same files in an S3 bucket is failing to complete even after 10 minutes:

> select count(*) from s3.`releases`;

Why is this? I thought the whole point of Drill was that it was fast on big datasets.

My S3 connection itself is OK, e.g. SHOW files shows me my available folders just fine in a reasonable amount of time, and there's nothing wrong with my network connection either.


Solution

  • its not a direct answer to your question but you should look at athena if you want to query on s3 bucket and you have large dataset