Search code examples
amazon-web-servicesaws-glueamazon-kinesis-firehose

How to define AWS GLUE schema for JSON sent from python SDK to firehose?


I have this setup in mind:

PythonSDK sending predefined JSON -> aws kinesis firehose -> convert data to "Parquet" using AWS GLUE schema -> save data to S3 (either if succeed or not).

While sending primities type like strings, ints & booleans is easy, sending array/struct isn't trivial at all. I keep getting weird error messages of:

The schema is invalid. Error parsing the schema: Error: type expected at the position 0 of 'STRUCTname:STRING,id:BIGINT,is_bla:BOOLEAN' but 'STRUCT' is found.

OR

The schema is invalid. Error parsing the schema: Error: type expected at the position 0 of 'ARRAY' but 'ARRAY' is found.

  1. Why I'm getting those error messages?
  2. Is there a proper doc/examples for schema data types? i could only find this saying Column Type should match the "Single-line string pattern".

Solution

  • I'll answer my question:

    there is some delay between saving GLUE schema & sending data to firehose. updated JSONs I send used old schema hence the errors.

    also from this and that we have to validate some naming conventions ourselfs, it's quite unfortunate AWS doesn't do it upon creation.