Search code examples
amazon-web-servicesamazon-s3aws-lambdaamazon-kinesis-firehose

Push Firehose messages to an S3 bucket with minimal PUTs


We have an AWS Kinesis stream that ingests around 15 small binary messages per second. As a last resort data recovery strategy, we'd like to dump all messages received in an S3 bucket with 1-2 weeks TTL.

We could use a Lambda function to dump every Kinesis message to a new file in S3. But many small PUTs is expensive, especially because this data will not be accessed often (manually if so).

Alternatively, AWS Firehose would aggregate messages for us and push them to S3 as a single S3 object. But as I understand - please do correct me - Firehose simply concatenates records, so this doesn't work where messages binary and logically separate (unlike lines in a log file).

My current thoughts are to use a Lambda function attached to Firehose, so Firehose aggregates records over X minutes which we then zip/tar up, creating a file for each record, and send to S3 as a single archive.

Is this appropriate? If so, how do we aggregate records using Lambda? We process many-to-one, so I'm unsure what result/status codes to pass back to Firehose. (The AWS ecosystem is very new to me, so I think I might've missed the obvious solution.)


Solution

  • If you can accept 1 week TTL, you can increase the data retention period of the stream and not bother with any other storage mechanism.

    If you're willing to pay for 86,400 PUTs per day ($0.43), you can have the stream trigger your Lambda function. Note that you may actually be called more often, as there's a maximum size for each event, and each shard is invoked separately.

    If you want more control, I recommend invoking your Lambda function from a CloudWatch Scheduled Event. I believe that the minimum interval is one minute. However, if you do this you'll also need to preserve the shard offsets (for example, in DynamoDB) and be prepared for resharding.

    As you noted, Firehose doesn't support a many-to-one transformation. However, you could use Lambda to take the input records, base-64-encode them, and append a newline (note that you'd Base64-encode twice: once as the record transformation, and once to prepare the result for Firehose).