Search code examples
apache-stormdistributed-computing

Where does Apache Storm store tuples before a node is available to process it?


I am reading up on Apache Storm to evaluate if it is suited for our real time processing needs.

One thing that I couldn't figure out until now is — Where does it store the tuples during the time when next node is not available for processing it. For e.g. Let's say spout A is producing at the speed of 1000 tuples per second, but the next level of bolts(that process spout A output) can only collectively consume at a rate of 500 tuples per second. What happens to the other tuples ? Does it have a disk-based buffer(or something else) to account for this ?


Solution

  • Storm used internal in-memory message queues. Thus, if a bolt cannot keep up processing, the messages are buffered there.

    Before Storm 1.0.0 those queues may grow out-of-bound (ie, you get an out-of-memory exception and your worker dies). To protect from data loss, you need to make sure that the spout can re-read the data (see https://storm.apache.org/releases/1.0.0/Guaranteeing-message-processing.html)

    You could use "max.spout.pending" parameter, to limit the tuples in-flight to tackle this problem though.

    As of Storm 1.0.0, backpressure is supported (see https://storm.apache.org/2016/04/12/storm100-released.html). This allows bolt to notify its upstream producers to "slow down" if a queues grows too large (and speed up again in a queues get empty). In your spout-bolt-example, the spout would slow down to emit messaged in this case.