Search code examples
apache-sparkreal-timeapache-storm

Spark streaming recover from stoppage


I'm looking for a way to stream log data from files into our database. I've been reading about Spark streaming and Storm manage real-time but I don't know how to manage data non processed because an stoppage.

I mean, let say system is running and data is processed real-time, suddenly the system stops and restarts after 10 minutes. Is there a way to process this pending data without affecting the real-time stream ?

Thanks


Solution

  • For example on Storm you need to read from a reliable data source, which holds the incoming messages and allow the consumer to continue from the point where it stopped. An example of such data source is kafka

    In the case of Kafka, the live stream will not stop because your consumers (storm, spark, or whatever you are using) stop. Kafka will continue receive messages and will keep supply them to the clients whom are subscribing to a particular stream.

    The key, for fault tolerance is on the system you choose to distribute your live stream, and not on the tools you choose to process it. Your processing tools can always recover from the point where they stopped and continue the processing as long the messaging system allows it.

    Another message system broker that can handle consumer failure is Rabbit MQ.