Search code examples
apache-sparkapache-kafkadatabricksspark-structured-streaming

Trigger.Once Spark Structured Streaming with KAFKA offsets and writing to KAFKA continues


  • When using Spark Structured Streaming with Trigger.Once and processing KAFKA input

    • then if running the Trigger.Once invocation

      • and KAFKA is being written to as well simultaneously

        • will the Trigger.Once invocation see those newer KAFKA records being written during current invocation?
        • or will they not be seen until next invocation of Trigger.Once?

Solution

  • From the manuals: it processes all. See below.

    Configuring incremental batch processing Apache Spark provides the .trigger(once=True) option to process all new data from the source directory as a single micro-batch. This trigger once pattern ignores all setting to control streaming input size, which can lead to massive spill or out-of-memory errors.

    Databricks supports trigger(availableNow=True) in Databricks Runtime 10.2 and above for Delta Lake and Auto Loader sources. This functionality combines the batch processing approach of trigger once with the ability to configure batch size, resulting in multiple parallelized batches that give greater control for right-sizing batches and the resultant files.