I have a Kinesis Firehose Stream, where JSONs of different schemas are produced into it. Since this data eventually should be accessed by other tools that rely on schema (Glue, Athena) I want to separate them by schema to different prefixes in some S3 bucket.
I don't want to use different streams for different schemas.
So for example, if the following JSONs were sent into the stream
{'a': 1, 'b': 2} # JSON 1
{'a': 8, 'b': 5} # JSON 2
{'c': 9} # JSON 3
I would like them to eventually be stored in the S3 bucket as follows
/mybucket/YYYY/MM/DD/HH/schema1/json1.json # JSON 1
/mybucket/YYYY/MM/DD/HH/schema1/json2.json # JSON 2
/mybucket/YYYY/MM/DD/HH/schema2/json3.json # JSON 3
I do know all the schemas possible in advance.
How should I go about that?
AWS released Kinesis Data Firehose Dynamic Partitioning in September/2021. This feature can use one or more key/value from JSON to compose the partition.
You can take a look at this resource links.
https://docs.aws.amazon.com/firehose/latest/dev/dynamic-partitioning.html