Search code examples
amazon-web-servicesamazon-s3apache-drill

Apache Drill: How to query all files in an S3 bucket?


I am using Apache Drill 1.10 on OSX. I have an S3 bucket with about 150 files in, and I have set up a connection between the two as per the Drill documentation.

I can see all the files in the bucket, from Drill:

jdbc:drill:zk=local> USE `s3`.`root`;
+-------+--------------------------------------+
|  ok   |               summary                |
+-------+--------------------------------------+
| true  | Default schema changed to [s3.root]  |
+-------+--------------------------------------+
1 row selected (1.123 seconds)

jdbc:drill:zk=local> SHOW files; 
<shows table of 15 files>

And I can query individual files successfully:

jdbc:drill:zk=local> select * from s3.`ocds-b5fd17-00ec7c92-54f3-4c50-8214-8b8c0cf9ff09-140281-qc54303.json`;
<returns results>

But how do I query across the whole directory? On the local filesystem I'd just supply a wildcard like dfs./path/*.json, but that doesn't seem to work with S3:

jdbc:drill:zk=local> select * from s3.`*.json`;
Error: VALIDATION ERROR: Can not create a Path from an empty string
SQL Query null

Solution

  • I figured this out, recording it here for the benefit of anyone else with the same problem.

    Create a directory in the root of the bucket, called e.g. releases, and put the files in there.

    Then you can just query all files in the directory with:

    select * from s3.`releases`;