Search code examples
amazon-web-servicesamazon-s3amazon-kinesisamazon-kinesis-firehose

How to replay in a stream data pushed to S3 from AWS Firehose?


pipeline There is a plenty of examples how data is stores by AWS Firehose to S3 bucket and parallelly passed to some processing app (like on the picture above).

But I can't find anything about good practice of replaying this data from s3 bucket in case if processing app was crushed. And we need to supply it with historical data, which we have in s3, but which is already not in the Firehose.

I can think of replaying it with Firehose or Lambda, but:

  1. Kinesis Firehose could not consume from bucket
  2. Lambda will need to deserialize .parquet file to send it to Firehose or Kinesis Data Stream. And I'm confused with this implicit deserializing, because Firehose was serializing it explicitly.

Or maybe there is some other way to put data back from s3 to stream which I completely miss?

EDIT: More over if we will run lambda for pushing records to stream probably it will have to rum more that 15 min. So another option is to run a script doing it which runs on separate EC2 instance. But this methods of extracting data from s3 looks so much more complicated than storing it there with Firehose, that is makes me think there should be some easier approach


Solution

  • The problem which stuck me was actually that I expect some more advanced serialization than just converting to JSON (as Kafka support AVRO for example).

    Regarding replaying records from s3 bucket: this part of solution seems to be really significantly more complicated, than the one needed for archiving records. So if we can archive stream with out of the box functions of Firehose, for replaying it we will need two lambda functions and two streams.

    1. Lambda 1 (pushes filenames to stream)
    2. Lambda 2 (activated for every filename in the first stream, pushes records from files to second stream)

    First lambda is triggered manually, scans through all s3 bucket files and write their names to first stream. Second lambda function is triggered by every event is stream with file names, reads all the records in the file and sends them to final stream. From which there could be consumed but Kinesis Data Analytics or another Lambda.

    This solution expects that there are multiple files generated per day, and there are multiple records in every file.

    Similar to this solution, but destination is Kinesis in my case instead of Dynamo in the article.