Search code examples
jsonrubyregexamazon-s3amazon-kinesis-firehose

How to read invalid JSON format amazon firehose


I've got this most horrible scenario in where i want to read the files that kinesis firehose creates on our S3.

Kinesis firehose creates files that don't have every json object on a new line, but simply a json object concatenated file.

{"param1":"value1","param2":numericvalue2,"param3":"nested {bracket}"}{"param1":"value1","param2":numericvalue2,"param3":"nested {bracket}"}{"param1":"value1","param2":numericvalue2,"param3":"nested {bracket}"}

Now is this a scenario not supported by normal JSON.parse and i have tried working with following regex: .scan(/({((\".?\":.?)*?)})/)

But the scan only works in scenario's without nested brackets it seems.

Does anybody know an working/better/more elegant way to solve this problem?


Solution

  • The one in the initial anwser is for unquoted jsons which happens some times. this one:

    ({((\\?\".*?\\?\")*?)})

    Works for quoted jsons and unquoted jsons

    Besides this improved it a bit, to keep it simpler.. as you can have integer and normal values.. anything within string literals will be ignored due too the double capturing group.

    https://regex101.com/r/kPSc0i/1