I have a storm setup that picks up messages from a kafka topic and processes and persists them. I want to understand how storm gurantees message processing in such a scenario
Consider the below scenario: I have configured multiple supervisors+workers for a storm cluster. The KafkaSpout is reading message from the topic and then passes on this a bolt. The bolt acks upon completion and the spout moves forward to the next message.
I have 2 supervisors running - each of which are running 3 workers each. From what I understand - each of the worker on every supervisor is capable to processing a message.
So, at any given time 6 messages are being processed parallely in storm cluster. what if the second message fails, either due to worker shutdown or due to supervisor shutdown. the zookeeper is already pointing to the 7 message for the consumer group. In such a scenario, how will the second message get processed?
I guess there is some miss understanding. The following claims seem to be wrong:
=> A spout is not waiting for the acks; it fetches tuples over-and-over again with the maximum speed regardless of the processing speed of the bolts -- as long as new messages are available in Kafka. (Or did you limit the number of tuples in flight via max.spout.pending
?). Thus, many messages are processed in parallel (even if only #executors are given to a UDF -- many other messages are buffered in internal Storm queues).
As far as I know (but I am not 100% sure), KafkaSpout
"orders" the incoming acks and only move the offset if all consecutive acks are available -- ie, message 7 is not acked to Kafka if the Storm ack of message 6 is not there yet. Thus, KafkaSpout
can re-emit message 6 if it fails. Re-call that Storm does not give any ordering guarantees.