Search code examples
amazon-s3parquetorcamazon-auroraamazon-athena

Easiest way to migrate data from Aurora to S3 in Apache ORC or Apache Parquet


Athena looks nice.

To use it, at our scale, we need to make it cheaper and more performant, which would mean saving our data in ORC or Parquet formats.

What is the absolute easiest way to migrate an entire Aurora database to S3, transforming it into one of those formats?

DMS and Data Pipeline seem to get you there minus the transformation step...


Solution

  • The transform step can be done with python, here is a sample: https://github.com/awslabs/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversion

    See this article: http://docs.aws.amazon.com/athena/latest/ug/partitions.html

    I would try DMS to initially create the data in s3 and then use the above python.