amazon-web-services spark-streaming amazon-kinesis amazon-kinesis-kpl resiliency

Kinesis Producer callback functions - guaranteed delivery?

Streaming to Kinesis billions of messages a day.

We're looking for an implementation that would allow us to deliver messages to Kinesis with exactly-once guarantee.

Our producer framework requires a streaming sink to be idempotent for exactly-once delivery guarantee, which Kinesis is not. So we're getting at-least once deliveries currently. (duplicates are possible and we do see them, when a streaming micro-batch has to restart for whatever reason on the producer side)

We started looking at Kinesis Producer Library (KPL) callback functions. Basically we would be tracking state of what messages were delivered and what not in DynamoDB based on a key that's present in each message. And if we know that a message was already sent, we will skip it for delivery re-attempt. Then it seems exactly-once is possible.. with two concerns:

1) The only question we have - how likely it is we would lose a invocation of the callback function (e.g. network glitch etc), or the callback function itself has failed (e.g. we ran into a DynamoDB limit/ outage etc) -- is this documented somewhere? I know the chances are not high, but we want to design a system that would be resilient to some expected things like these.

2) Timing. Let's say if for whatever reason Kinesis invoked a callback function with a delay (5-15 milliseconds would be enough to break some assumptions in the above callback functions that persists delivery state in DynamoDB). And while we haven't received a confirmation on the delivery, our streaming producer framework has attempted redelivery that it thinks wasn't yet delivery. Any workarounds for this potential issue?

ps. We know that one way to workaround, is to make dedups on an application side (receiver from that kinesis stream), but that's outside of our project and we have a hard requirement to get exactly-once into that Kinesis stream.

Solution

For #1, any path you go down you'll find yourself in edge cases that could lead you to loss of data, or duplicate calls. Even using a two phased commit protocol doesn't work here if the consumer isn't participating in that protocol.

For #2, Kinesis is ordered, so if you do get duplicates you should be able to reliably assume they will be on the same shard, and thus not processed while another reader is still processing (assuming one reader per shard). Just make sure you are using a strongly consistent read when calling DynamoDB.