Search code examples
apache-sparkmapreducepartitionreduction

How to force spark to perform reduction locally


I'm looking for a trick to force Spark to perform reduction operation locally between all the tasks executed by the worker cores before making it for all tasks. Indeed, it seems my driver node and the network bandwitch are overload because of big task results (=400MB).

val arg0 = sc.broadcast(fs.read(0, 4))
val arg1 = sc.broadcast(fs.read(1, 4))
val arg2 = fs.read(5, 4) 
val index = info.sc.parallelize(0.toLong to 10000-1 by 1)
val mapres = index.map{ x => function(arg0.value, arg1.value, x, arg2) }
val output = mapres.reduce(Util.bitor)

The driver distributes 1 partition by processor core so 8 partitions by worker.


Solution

  • There is nothing to force because reduce applies reduction locally for each partition. Only the final merge is applied on the driver. Not to mention 400MB shouldn't be a problem in any sensible configuration.

    Still if you want to perform more work on the workers you can use treeReduce although with 8 partitions there is almost nothing to gain.