Search code examples
apache-sparkrdd

How to use Spark RDD to make batch submit?


I have RDD of lots of items, just simplify it likes:

[0,1,2,3,4,5,6,7,8,9]

and submit those items to batch API(API.post(a[])). but API limits max batch(exp. 3). So for the best performance, I need to transform RDD iterator into the limit Array as possible:

[[0,1,2], [3,4,5], [6,7,8], [9]]

and I use Spark Java to push the data to API.

rdd.foreach(a -> { API.post(a)}

My question is how to transform it?


Solution

  • To be clear, there is no RDD iterator but an iterator for each partition. To access them, foreachPartition can be used and then batching the iterator can be done with plain old Java iterator ops. Here is a solution using the Spark Java API http://spark.apache.org/docs/2.4.0/api/java/org/apache/spark/api/java/JavaRDD.html#foreachPartition-org.apache.spark.api.java.function.VoidFunction- and Guava:

    rdd.foreachPartition(it -> 
      Iterators.partition(it, batchSize)
               .forEachRemaining(API::post));