Search code examples
jsongoogle-bigqueryapache-drillbigdata

Run SQL-like queries across large JSON datasets, in the cloud?


Is there a product that meets these requirements:

  • running SQL-like queries (preferably with a friendly UI, though can use command line if necessary)
  • across largely static JSON datasets of several million documents, without a consistent schema
  • with the JSON documents stored in the cloud, not locally.

Basically, I would like Google BigQuery, but without the requirement for a consistent pre-specified schema.

Apache Drill offers SQL-like queries across large JSON datasets with different schemas. However, I don't think it works well on cloud-hosted datasets - have set it up to talk to an S3 bucket, and it seems to need to download all the data locally, which makes it unusable with large datasets.


Solution

  • BigQuery requires a static schema, but one option is to define the tables as CSV with a separator that doesn't appear in the data, meaning you'd have a single string column. You'll be limited to the JSON_EXTRACT functions for pulling data from the JSON, but from there you can cast to other data types and so on.

    You could also consider filing a feature request for the BigQuery team to consider supporting a document type to make working with this kind of data easier in the future.