Search code examples
javaapache-sparkrdd

Where exactly the raw java code execute in Spark?


I'm new to Spark and I'm aware Spark usually serializes the functions and sends it to all the executors and work on the blocks of data available in the HDFS. But if I have the following code,

Random random = new Random(); //statement A
int randomValue = random.nextInt(); //statement B
JavaPairRDD<String, Integer> pairRDD = mapRandom.mapToPair(s -> {
    return new Tuple2<>(s, randomValue);
});

Where exactly the statement A, B runs? Definitely, it doesn't execute it in each executor, because each executor runs its own JVM and I found that each value (s) is mapped to the exact same randomValue.

Does it mean these statement A, B run in Driver and gets copied to all executors?


Solution

  • The RDD is distributed (hence the name) but it gets defined in the driver code, so whatever does not interact with the RDD contents is code that executes on the driver (as a rule of thumb), while any code that has to do with the RDD contents is going to be run on the executors.

    As you noticed, the driver computes your randomValue once and sends it to all executors, as the mapToPair lambda parameter closes over that computed value.

    Moving (just) the random.nextInt() invocation inside the lambda would trigger that to be executed for each element in your distributed collection. Also, the random value itself will be serialised and sent over the wire.

    Moving the random creation itself inside the lambda would make it smaller (no external state to capture) but would create a new Random instance for every element in the distributed collection, which is clearly suboptimal.

    To have a single random instance value per executor, you can make that a static member (or a lazy singleton) that gets initialised once per JVM/executor. To have a different random value per (say) partition, you should probably use forEachPartition or similar, generate a new value with nextInt() and use that value for all the elements in that partition.

    Have a look at the higher level DataFrame/SQL APIs, as you may find it a lot easier to implement what you want without even worrying about where your code executes.