I come from a batch processing background, but recently started working exclusively with streaming. I found out that in apache Spark structured streaming, for example, we can do Trigger.AvailableNow()
and it will process all available data and stop, when there's no more data, and its advantages over batch are:
The fact that it checkpoints, so you won't get into a situation, like in batch, where you waited for a job to run for 5 out of 6 hours and it failed, and now you have to restart it again and wait another 6 hours. You can simply restart from checkpoint and continue where you left off.
Trigger.AvailableNow()
also breaks input data into microbatches, so, I guess there's a much smaller chance that OutOfMemoryException
will occur. Am I right?
So, with all this in mind, I'd like someone more experienced to explain to me what is the point of batch processing? Can't we do everything with streaming with more advantages?
I guess I don't know certain technical details.
Maybe there's less support for connectors to different sources with streaming, but can't that be easily resolved?
Asked some colleagues, got these examples:
Postgres connector, for example, doesn't support sparkSession.readStream
only sparkSession.read
You can't do window functions with microbatches properly, for example, if a window function needs to be applied to the whole dataset, you can't do that with streaming, or if you can, it is extremely hard. Although if your version of Spar supports MERGE INTO, it can be done in some way.
I'm sure there are other things as well, hope somebody posts a more comprehensive list.