Search code examples
amazon-s3aws-glueamazon-athenaamazon-kinesis-firehose

DynamoDB data to S3 in Kinesis Firehose output format


Kinesis data firehose has a default format to add files into separate partitions in S3 bucket which looks like : s3://bucket/prefix/yyyy/MM/dd/HH/file.extension

I have created event streams to dump data from DynamoDB to S3 using Firehose. There is a transformation lambda in between which converts DDB records into TSV format (tab separated).
All of this is added on an existing table which already contains huge data. I need to backfill the existing data from DynamoDB to S3 bucket maintaining the parity in format with existing Firehose output style.

Solution I tried :
Step 1 : Export the Table to S3 using DDB Export feature. Use Glue crawler to create Data catalog Table.
Step 2 : Used Athena's CREATE TABLE AS SELECT Query to imitate the transformation done by the intermediate Lambda and storing that Output to S3 location.
Step 3 : However, Athena CTAS applies a default compression that cannot be done away with. So I wrote a Glue Job that reads from the previous table and writes to another S3 location. This job also takes care of adding the partitions based on year/month/day/hour as is the format with Firehose, and writes the decompressed S3 tab-separated format files.

However, the problem is that Glue creates Hive-style partitions which look like : s3://bucket/prefix/year=2021/month=02/day=02/. And I need to match the firehose block style S3 partitions instead. I am looking for an approach to help achieve this. Couldn't find a way to add block style partitions using Glue. Another approach I have is, to use AWS CLI S3 mv command to move all this data into separate folders with correct file-name which is not clean and optimised.


Solution

  • Leaving the solution I ended up implementing here in case it helps anyone.

    I created a Lambda and added S3 event trigger on this bucket. The Lambda did the job of moving the file from Hive-style partitioned S3 folder to correctly structured block-style S3 folder.

    The Lambda used Copy and delete function from boto3 s3Client to implement the same. It worked like a charm even though I had like > 10^6 output files split across different partitions.