Search code examples
scaladatabricksspark-structured-streamingamazon-kinesisaws-databricks

Calling Trigger once in Databricks to process Kinesis Stream


I am looking a way to trigger my Databricks notebook once to process Kinesis Stream and using following pattern

 import org.apache.spark.sql.streaming.Trigger

// Load your Streaming DataFrame
   val sdf = spark.readStream.format("json").schema(my_schema).load("/in/path")
// Perform transformations and then write…
   sdf.writeStream.trigger(Trigger.Once).format("delta").start("/out/path")

It looks like it's not possible with AWS Kinesis and that's what Databricks documentation suggest as well. My Question is what else can we do to Achieve that?


Solution

  • As you mentioned in the question the trigger once isn't supported for Kinesis.

    But you can achieve what you need by adding into the picture the Kinesis Data Firehose that will write data from Kinesis into S3 bucket (you can select format that you need, like, Parquet, ORC, or just leave in JSON), and then you can point the streaming job to given bucket, and use Trigger.Once for it, as it's a normal streaming source (For efficiency it's better to use Auto Loader that is available on Databricks). Also, to have the costs under the control, you can setup retention policy for your S3 destination to remove or archive files after some period of time, like 1 week or month.