I am trying to learn the parallelism and scalability features offered by Storm and read the following article http://storm.apache.org/documentation/Understanding-the-parallelism-of-a-Storm-topology.html. I am confused that whether Storm supports data or task parallelism. What I could understand ( I may be wrong) is that Storm supports task parallelism (since the degree of parallelism is restricted by the number of tasks in the topology). If this is the case then how can it be used for large scale parallel data processing which requires data parallelism.
Any help would be greatly appreciated. Thanks :)
Storm does not follow text book terminology. In fact, Storm does support data, task, and pipelined parallelism.
If you have an operator and assign a parallelism larger than one (parallelism_hint
) you get as many threads as specified by the parameter, each executing the same code on different data, ie, you get data parallelism. You can further assign parameter number_of_tasks
(which must be >= parallelism_hint
) to split the input data into number_of_task
partitions/substreams (ie, more partitions than executors). Thus, some executor threads need to process multiple partitions/substreams (called tasks in Storm). This does not increase the parallelism (maybe concurrency). However, it allows to change the number of executor at runtime.
As you have multiple spouts and bolts in your topology and all those spouts and bolt are executed in different thread and even different machines, you have task parallelism here (not to confuse with Storm's usage of the term task!). As there are produce/consumer relationships between spouts/bolts you also get pipeline parallelism hers, which is a special form of task parallelism. Another form of task parallelism in Storm is the ability to run multiple topology at the same time.