I want to write a Spark job that produces millions of random numbers as output. This does not need an input dataset, but it would be good to have the parallelism of a cluster.
I understand that Spark runs on RDD which are datasets by definition, I am just wondering if there is a way to force many executors to run a specific function without having an RDD, or by creating a mock RDD.
sc.parallelize(Seq(1000, 1000, 1000))
.repartition(3)
.flatMap({count => 0.to(count).map(_ => Random.nextInt)})