amazon-web-services amazon-kinesis-firehose

AWS Firehose to S3 prefix by content

I have a Kinesis Firehose Stream, where JSONs of different schemas are produced into it. Since this data eventually should be accessed by other tools that rely on schema (Glue, Athena) I want to separate them by schema to different prefixes in some S3 bucket.

I don't want to use different streams for different schemas.

So for example, if the following JSONs were sent into the stream

{'a': 1, 'b': 2}  # JSON 1
{'a': 8, 'b': 5}  # JSON 2
{'c': 9}  # JSON 3

I would like them to eventually be stored in the S3 bucket as follows

/mybucket/YYYY/MM/DD/HH/schema1/json1.json  # JSON 1
/mybucket/YYYY/MM/DD/HH/schema1/json2.json  # JSON 2
/mybucket/YYYY/MM/DD/HH/schema2/json3.json  # JSON 3

I do know all the schemas possible in advance.

How should I go about that?

Solution

AWS released Kinesis Data Firehose Dynamic Partitioning in September/2021. This feature can use one or more key/value from JSON to compose the partition.

You can take a look at this resource links.

https://aws.amazon.com/about-aws/whats-new/2021/08/introducing-dynamic-partitioning-amazon-kinesis-data-firehose/

https://docs.aws.amazon.com/firehose/latest/dev/dynamic-partitioning.html