Given that I have 3 consumers for a single hub, all consuming the same consumer group, lets say that client A got event 1, client B got event 2 and client C got event 3.
In a twist of fate, client C who got a event last (event 3) was actually the first to complete the job and UpdatedCheckpointAsync.
Does that mean that even if Client A and B fail to complete their job, and do not update the checkpoint, their events are no longer available?
Edit 1:
I setup up the following experiment:
Expectation:
Run
traces
| where message == "Event received"
| summarize count() by bin(timestamp,1s), cloud_RoleInstance
| render timechart
and see something like
(pls notice how both pods are posting logs at the same time, indicating that they are receiving and processing events)
but instead I am seeing this:
which leads me to believe that a machine is sitting idle while the other is doing all the work!
There's some potentially confusing terminology being used, so I'm going to make my best guess at the scenario. Please correct me if I'm misinterpreting:
You're using 3 instances of EventProcessorClient
configured to read from the same Event Hub and belonging to the same consumer group.
The phrase "3 consumers" was meant to indicate those processors; there are not EventHubConsumerClient
instances involved in the scenario.
Client A
, Client B
, and Client C
refer to those EventProcessorClient
instances; those are not some external client that the processor is delegating to.
Assuming that my interpretation is correct, then the important thing to note is that each partition of the Event Hub will be owned by one, and only one, event processor.
During normal operation if Client A
reads event 1, Client B
reads event 2, and Client C
reads event 3, then each of those events came from a different partition. Checkpoints are also scoped to a partition, so A
, B
, and C
are not overwriting one another - each is working against a checkpoint unique to that partition.
There's a couple of caveats, however:
There can be short periods of overlap where multiple processors are emitting events for the same partition. This happens when the number of processors scales up/down and partition ownership transitions. During this window (typically 30 seconds or less) it is possible to overwrite a checkpoint with an earlier location - but the rollback would be limited to the number of events that your application processes in that period.
If your event handler for a processor does not perform exception handling and throws, the processor will not rewind to the checkpoint; it will read the next event in sequence when the partition processing task restarts. (this is intended to avoid a poison event blocking forward progress)
I'd highly recommend checking out the docs for processor event handlers, if you have not already done so.