Search code examples
apache-storm

Storm message timeout works improperly


I found the following when I used the debug mode. TOPOLOGY_MESSAGE_TIMEOUT_SECS is set to 90s. Spout should not get more than 90s from sending to getting a fail message, but why is 135s in the log?

2019-05-14 16:53:12.037 o.a.s.s.CheckpointSpout Thread-13-$checkpointspout-executor[1 1] [DEBUG] Current state CheckPointState{txid=7, state=COMMITTING}, emitting txid 7, action COMMIT
2019-05-14 16:55:27.097 o.a.s.s.CheckpointSpout Thread-13-$checkpointspout-executor[1 1] [DEBUG] Got fail with msgid 7
2019-05-14 16:55:27.097 o.a.s.s.CheckpointSpout Thread-13-$checkpointspout-executor[1 1] [DEBUG] Checkpoint failed, will trigger recovery

Solution

  • The message timeout is not a hard limit. The messages may take up to 2x the timeout to actually time out. This is due to a performance optimization, where instead of timing out tuples every second, we have two buckets. When a tuple is created, it gets put into bucket 1. Once the timeout has passed, we rotate all bucket 1 tuples into bucket 2, and fail all bucket 2 tuples. This lets us guarantee that a tuple gets at least the full message timeout to complete, while also being cheap to compute.