When using Spark Structured Streaming with Trigger.Once
and processing KAFKA input
then if running the Trigger.Once
invocation
and KAFKA is being written to as well simultaneously
Trigger.Once
invocation see those newer KAFKA records being written during current invocation?Trigger.Once
?From the manuals: it processes all. See below.
Configuring incremental batch processing Apache Spark provides the .trigger(once=True) option to process all new data from the source directory as a single micro-batch. This trigger once pattern ignores all setting to control streaming input size, which can lead to massive spill or out-of-memory errors.
Databricks supports trigger(availableNow=True) in Databricks Runtime 10.2 and above for Delta Lake and Auto Loader sources. This functionality combines the batch processing approach of trigger once with the ability to configure batch size, resulting in multiple parallelized batches that give greater control for right-sizing batches and the resultant files.