Search code examples
c#azure-eventhubevent-driven

Azure Event Hubs - Checkpoint per event?


Given that I have 3 consumers for a single hub, all consuming the same consumer group, lets say that client A got event 1, client B got event 2 and client C got event 3.
In a twist of fate, client C who got a event last (event 3) was actually the first to complete the job and UpdatedCheckpointAsync.

Does that mean that even if Client A and B fail to complete their job, and do not update the checkpoint, their events are no longer available?

Edit 1:
I setup up the following experiment:

  • Craeted a azure event hub with 10 partitions
  • Created a single consumer group
  • Filled it with 10k messages
  • Created 2 containers (on AKS) that would basically consume those events (using the same consumer group) and log them azure application insights

Expectation:
Run

traces
| where message == "Event received"
| summarize count() by bin(timestamp,1s), cloud_RoleInstance
| render timechart 

and see something like notice the intertwine
(pls notice how both pods are posting logs at the same time, indicating that they are receiving and processing events)

but instead I am seeing this: no intertwining
which leads me to believe that a machine is sitting idle while the other is doing all the work!

Edit 2
Here are 3 10k items run spaced by a small interval: enter image description here


Solution

  • There's some potentially confusing terminology being used, so I'm going to make my best guess at the scenario. Please correct me if I'm misinterpreting:

    • You're using 3 instances of EventProcessorClient configured to read from the same Event Hub and belonging to the same consumer group.

    • The phrase "3 consumers" was meant to indicate those processors; there are not EventHubConsumerClient instances involved in the scenario.

    • Client A, Client B, and Client C refer to those EventProcessorClient instances; those are not some external client that the processor is delegating to.

    Assuming that my interpretation is correct, then the important thing to note is that each partition of the Event Hub will be owned by one, and only one, event processor.

    During normal operation if Client A reads event 1, Client B reads event 2, and Client C reads event 3, then each of those events came from a different partition. Checkpoints are also scoped to a partition, so A, B, and C are not overwriting one another - each is working against a checkpoint unique to that partition.

    There's a couple of caveats, however:

    • There can be short periods of overlap where multiple processors are emitting events for the same partition. This happens when the number of processors scales up/down and partition ownership transitions. During this window (typically 30 seconds or less) it is possible to overwrite a checkpoint with an earlier location - but the rollback would be limited to the number of events that your application processes in that period.

    • If your event handler for a processor does not perform exception handling and throws, the processor will not rewind to the checkpoint; it will read the next event in sequence when the partition processing task restarts. (this is intended to avoid a poison event blocking forward progress)

      I'd highly recommend checking out the docs for processor event handlers, if you have not already done so.