val jsonDF = spark.readStream.format("json").schema(schema).load("source")
val result = jsonDF.groupBy("origin").sum("value")
val query = result.writeStream.outputMode("complete").format("console").start()
the partitions are always 200
The doc doesn't mention much : https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
It only says that it is possible to set when using rate source with the option of 'numPartitions'
I have tried to set it in the readstream in the following ways: spark.readStream.format("json").schema(schema).option("spark.sql.shuffle.partitions",6).load("source") spark.readStream.format("json").schema(schema).option("numPartitions",6).load("source")
No effect at all. the partition is always 200 and because of this even a simple query is slow as hell.
Had to set the config on Spark session level and not on query level.
The documentation was unclear about it.
spark.conf.set("spark.sql.shuffle.partitions",6)