Search code examples
amazon-s3aws-glueamazon-kinesis-firehose

AWS Glue - how to crawl a Kinesis Firehose output folder from S3


I have what I think should be a relatively simple use case for AWS Glue, yet I'm having a lot of trouble figuring out how to implement it.

I have a Kinesis Firehose job dumping streaming data into a S3 bucket. These files consist of a series of discrete web browsing events represented as JSON documents with varying structures (so, say, one document might have field 'date' but not field 'name', whereas another might have 'name' but not 'date').

I wish to run hourly ETL jobs on these files, the specifics of which are not relevant to the matter at hand.

I'm trying to run a S3 data catalog crawler and the problem I'm running into is that the Kinesis output format is not, itself, valid JSON, which is just baffling to me. Instead it's a bunch of JSON documents separated by a line break. The crawler can automatically identify and parse JSON files, but it cannot parse this.

I thought of writing a lambda function to 'fix' the Firehose file, triggered by its creation on the bucket, but it sounds like a cheap workaround for two pieces that should fit neatly together.

Another option would be just bypassing the data catalog altogether and doing the necessary transformations in the Glue script itself, but I have no idea how to get started on this.

Am I missing anything? Is there an easier way to parse Firehouse output files or, failing that, bypassing the need for a crawler?

cheers and thanks in advance


Solution

  • I managed to fix this; basically the problem was that not every JSON document had the same underlying structure.

    I wrote a lambda script as part of the Kinesis process that forced every document into the same structure, by adding NULL fields where necessary. The crawlers were then able to correctly parse the resulting files and map them to a single table.