Search code examples
apache-sparkpysparkapache-spark-sqlperformance-testingspark-structured-streaming

spark streaming rate source generate rows too slow


I am using Spark RateStreamSource to generate massive data per second for a performance test.

To test I actually get the amount of concurrency I want, I have set the rowPerSecond option to a high number 10000,

    df = (
        spark.readStream.format("rate")
        .option("rowPerSecond", 100000)
        .option("numPartitions", 100)
        .load()
    )

However, when I run my pyspark script locally, the row generation is very slow. (less than 1 row per second)

I printed out the result, as you can see from the log extract below, the row count is 142 after about a minute

Row content: Row(timestamp=datetime.datetime(2021, 12, 6, 23, 36, 15, 16000), value=142)

So my question is:

  • Why is the rate source not working as I expected, does it have anything to do since I run it locally?
  • How can I increase the concurrency locally with my spark job?

Solution

  • You have a typo in the option - it should be rowsPerSecond.