Search code examples
distributed-computingapache-stormapache-kafkadistributed-systemstream-processing

Parallelism behaviour of stream processing engines


I have been learning Storm and Samza in order to understand how stream processing engines work and realized that both of them are standalone applications and in order to process an event I need to add it to a queue that is also connected to stream processing engine. That means I need to add the event to a queue (which is also a standalone application, let's say Kafka), and Storm will pick the event from the queue and process it in a worker process. And If I have multiple bolts, each bolt will be processed by different worker processes. (Which is one of the things I don't really understand, I see that a company that uses more than 20 bolts in production and each event is transferred between bolts in a certain path)

However I don't really understand why I would need such complex systems. The processes involves too much IO operations (my program -> queue -> storm ->> bolts) and it makes much more harder to control and debug the them.

Instead, if I'm collecting the data from web servers, why not just use the same node for event processing? The operations will be already distributed over the nodes by load-balancers which I use for web servers. I can create executors on same JVM instances and send the events from web server to the executor asynchronously without involving any extra IO requests. I can also watch the executors in web servers and make sure that the executor processed the events (at-least-once or exactly-one processing guarantee). In this way, it will be a lot easier to manage my application and since not much IO operation is required, it will be faster compared to the other way which involves sending the data to another node over the network (which is also not reliable) and process it in that node.

Most probably I'm missing something here because I know that many companies actively uses Storm and many people I know recommend Storm or other stream processing engines for real-time event processing but I just don't understand it.


Solution

  • My understanding is that the goal of using a framework like Storm is to offload the heavy processing (whether cpu-bound, I/O-bound or both) from the application/web servers and keep them responsive.

    Consider that each application server may have have to serve a large number of concurrent requests, not all of them having to do with stream processing. If the app server is already processing a significant load of events, then it could constitute a bottleneck for lighter requests, as the server resources (think cpu usage, memory, disk contention etc.) will already be tied to heavier processing requests.

    If the actual load you need to face isn't that heavy, or if it can simply be handled by adding app server instances, then of course it doesn't make sense to complexify your architecture/topology, which could in fact slow the entire thing down. It really depends on your performance and load requirements, as well as on how much (virtual) hardware you can throw at the problem. As usual, benchmarking based on your load requirements will help make a decision of which way to go.