Search code examples
apache-sparkspark-structured-streaming

How does the default (unspecified) trigger determine the size of micro-batches in Structured Streaming?


When the query execution In Spark Structured Streaming has no setting about trigger,

import org.apache.spark.sql.streaming.Trigger

// Default trigger (runs micro-batch as soon as it can)
df.writeStream
  .format("console")
  //.trigger(???) // <--- Trigger intentionally omitted ----
  .start()

As of Spark 2.4.3 (Aug 2019). The Structured Streaming Programming Guide - Triggers says

If no trigger setting is explicitly specified, then by default, the query will be executed in micro-batch mode, where micro-batches will be generated as soon as the previous micro-batch has completed processing.

QUESTION: On which basis the default trigger determines the size of the micro-batches?

Let's say. The input source is Kafka. The job was interrupted for a day because of some outages. Then the same Spark job is restarted. It will then consume messages where it left off. Does that mean the first micro-batch will be a gigantic batch with 1 day of msg which accumulated in the Kafka topic while the job was stopped? Let assume the job takes 10 hours to process that big batch, then the next micro-batch has 10h worth of messages? And gradually until X iterations to catchup the backlog to arrive to smaller micro-batches.


Solution

  • On which basis the default trigger determines the size of the micro-batches?

    It does not. Every trigger (however long) simply requests all sources for input datasets and whatever they give is processed downstream by operators. The sources know what to give as they know what has been consumed (processed) so far.

    It is as if you asked about a batch structured query and the size of the data this single "trigger" requests to process (BTW there is ProcessingTime.Once trigger).

    Does that mean the first micro-batch will be a gigantic batch with 1 day of msg which accumulated in the Kafka topic while the job was stopped?

    Almost (and really has not much if at all to do with Spark Structured Streaming).

    The number of records the underlying Kafka consumer gets to process is configured by max.poll.records and perhaps by some other configuration properties (see Increase the number of messages read by a Kafka consumer in a single poll).

    Since Spark Structured Streaming uses Kafka data source that is simply a wrapper of Kafka Consumer API whatever happens in a single micro-batch is equivalent to this single Consumer.poll call.

    You can configure the underlying Kafka consumer using options with kafka. prefix (e.g. kafka.bootstrap.servers) that are considered for the Kafka consumers on the driver and executors.