I'm analyzing the backpressure feature on Spark Structured Streaming. Does anyone know the details? Is it possible to tune process incoming records by code? Thanks
If you mean dynamically changing the size of each internal batch in Structured Streaming, then NO. There are not receiver-based sources in Structured Streaming, so that's totally not necessary. From another point of view, Structured Streaming cannot do real backpressure, because, such as, Spark cannot tell other applications to slow down the speed of pushing data into Kafka.
Generally, Structured Streaming will try to process data as fast as possible by default. There are options in each source to allow to control the processing rate, such as maxFilesPerTrigger
in File source, and maxOffsetsPerTrigger
in Kafka source. Read the following links for more details:
http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#input-sources http://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html