Search code examples
streamingspark-streamingstreamsets

Connecting Spark streaming to streamsets input


I was wondering if it would be possible to provide input to spark streaming from StreamSets. I noticed that Spark streaming is not supported within the StreamSets connectors destination https://streamsets.com/connectors/ .

I exploring if there are other ways to connect them for a sample POC.


Solution

  • The best way to process data coming in from Streamsets Data Collector (SDC) in Apache Spark Streaming would be to write the data out to a Kafka topic and read the data from there. This allows you to separate out Spark Streaming from SDC, so both can proceed at its own rate of processing.

    SDC microbatches are defined record count while Spark Streaming microbatches are dictated by time. This means that each SDC batch may not (and probably will not) correspond to a Spark Streaming batch (most likely that Spark Streaming batch will probably have data from several SDC batches). SDC "commits" each batch once it is sent to the destination - having a batch written to Spark Streaming will mean that each SDC batch will need to correspond to a Spark Streaming batch to avoid data loss. It is also possible that Spark Streaming "re-processes" already committed batches due to processing or node failures. SDC cannot re-process committed batches - so to recover from a situation like this, you'd really have to write to something like Kafka that allows you to re-process the batches. So having a direct connector that writes from SDC to Spark Streaming would be complex and likely have data loss issues.

    In short, your best option would be SDC -> Kafka -> Spark Streaming.