Search code examples
apache-sparkbigdataapache-flinkflink-streamingspark-structured-streaming

S3 based streaming solution using apache spark or flink


We have batch pipelines writing files (mostly csv) into an s3 bucket. Some of these pipelines write per minute and some of them every 5 mins. Currently, we have a batch application which runs every hour processing these files.

Business wants data to be available every 5 mins. Instead, of running batch jobs every 5 mins we decided to use apache spark structured streaming and process the data in real time. My question is how easy/difficult is productionise this solution?

My only worry is if checkpoint location gets corrupt, deleting the checkpoint directory will re-process data back from last 1 yr. Has anyone productionised any solution using s3 using spark structured streaming or you think flink is better for this use case?

If you think there is a better architecture/pattern for this problem, kindly point me in the right direction.

ps: We already thought of putting these files into kafka and ruled out due to the availability of bandwidth and large size of the files.


Solution

  • I found a way to do this, not the most effective way. Since we have already productionized Kafka based solution before, we could push a event into Kafka using s3 streams and lambda. The event will contain only metadata like file location and size.

    This will make the spark program a bit more challenging as the file will be read and processed inside the executor, which is effectively not utilising the distributed processing. Or else, read in executor and bring the data back to the driver to utilise the distributed processing of spark. This will require the spark app to be planned a lot better in terms of memory, ‘cos input file sizes change a lot.

    https://databricks.com/blog/2019/05/10/how-tilting-point-does-streaming-ingestion-into-delta-lake.html