Search code examples
apache-sparkdistributed-computing

Spark job with no input dataset


I want to write a Spark job that produces millions of random numbers as output. This does not need an input dataset, but it would be good to have the parallelism of a cluster.

I understand that Spark runs on RDD which are datasets by definition, I am just wondering if there is a way to force many executors to run a specific function without having an RDD, or by creating a mock RDD.


Solution

  • sc.parallelize(Seq(1000, 1000, 1000))
    .repartition(3)
    .flatMap({count => 0.to(count).map(_ => Random.nextInt)})