Search code examples
jsonamazon-s3amazon-dynamodbdistributed-computingbigdata

How can I export data from dynamo db to S3 and make faster ad hoc queries (instead of current spark 40 mins full table scan queries)?


Im currently working in big data team in a company, i need to export data from dynamo db to amazon s3, when exporting data and use spark to make queries on extracted semi-structred JSON, it takes 40 mins to make ad hoc queries with a full table scan. I read about apache drill and its ability to make seconds queries on unstructred data, should i proceed with apache drill or making a flattening for the json and store it as a hive ORC table (10 thousand columns)? in other word i need to make queries without the need to make a full table scan.


Solution

  • Well,

    If you are planning to work with Apache Drill, that is a good choice to not change the format of your data. Using Apache Drill it will do a table scan of your data, it will do a BIG IO in the S3 data if you are using JSON. For Sure it will be faster than Spark to do that.

    But according to Drill documentation they suggest to use Parquet for faster SQL queries. This will reduce the IO due to columnar interface. 10 thousand columns will not be a great problem, Drill will flat the data too.

    I really suggest you to flat your data to Orc, that will allow you to compress your data, and with that format you can query your data really fast with Presto or AWS Athena. The advantage to use Orc, or Parquet, both columnar data files. It will reduce the IO time of reading regarding the Metastore information. You will set the schema and it will be done.

    The great issue to do that is the overhead of building the schema... That will be a Huge Overhead for you.

    So, make your choice. Apache Drill will allow you to infer the schema in your Json format reducing the overhead of building the schema format and etc. And probably will be faster than Spark for query. But will not be faster than convert the files to Orc or Parquet format. Will not be that compact so you will have more data to store and more data to read that means more money to spend in AWS. Usin Orc or Parquet will be faster, compact and cheaper. But will take a lot of time to build the schema.