I am using Spark RateStreamSource to generate massive data per second for a performance test.
To test I actually get the amount of concurrency I want, I have set the rowPerSecond option to a high number 10000,
df = (
spark.readStream.format("rate")
.option("rowPerSecond", 100000)
.option("numPartitions", 100)
.load()
)
However, when I run my pyspark script locally, the row generation is very slow. (less than 1 row per second)
I printed out the result, as you can see from the log extract below, the row count is 142 after about a minute
Row content: Row(timestamp=datetime.datetime(2021, 12, 6, 23, 36, 15, 16000), value=142)
So my question is:
You have a typo in the option - it should be rowsPerSecond
.