Search code examples
apache-sparkspark-structured-streaming

How to set number of partitions for structured streaming?


val jsonDF = spark.readStream.format("json").schema(schema).load("source")
val result = jsonDF.groupBy("origin").sum("value")
val query = result.writeStream.outputMode("complete").format("console").start()

the partitions are always 200

The doc doesn't mention much : https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
It only says that it is possible to set when using rate source with the option of 'numPartitions'

I have tried to set it in the readstream in the following ways: spark.readStream.format("json").schema(schema).option("spark.sql.shuffle.partitions",6).load("source") spark.readStream.format("json").schema(schema).option("numPartitions",6).load("source")

No effect at all. the partition is always 200 and because of this even a simple query is slow as hell.


Solution

  • Had to set the config on Spark session level and not on query level.
    The documentation was unclear about it.

    spark.conf.set("spark.sql.shuffle.partitions",6)