Search code examples
scalaapache-sparkcheckpointing

Is checkpointing necessary in spark streaming


I have noticed that spark streaming examples also have code for checkpointing. My question is how important is that checkpointing. If its there for fault tolerance, how often do faults happen in such streaming applications?


Solution

  • It all depends on your use case. For suppose if you are running a streaming job, which just reads data from Kafka and counts the number of records. What would you do if your application crashes after a year or so?

    • If you don't have a backup/checkpoint, you will have to recompute all the previous one years worth data so you can resume counting.
    • If you have a backup/checkpoint, you can simply read the checkpoint data and resume instantly.

    Or if all you are just doing is having a streaming application which just Reads-Messages-From-Kafka >>> Tranform >>> Insert-to-a-Database, I need not worry about my application crashing. Even if it's crashed, i can simply resume my application without loss of data.

    Note: Check-pointing is a process which stores the current state of a spark application.

    Coming to the frequency of fault tolerance, you can almost never predict an outage. In companies,

    • There might be power outage
    • regular maintainance/upgrading of cluster

    hope this helps.