Search code examples
apache-sparkamazon-emr

Spark Performance issue while adding more worker nodes


I am being new on Spark. I am facing performance issue when the number of worker nodes are increased. So to investigate that, I have tried some sample code on spark-shell.

I have created a Amazon AWS EMR with 2 worker nodes (m3.xlarge). I have used the following code on spark-shell on the master node.

var df = sqlContext.range(0,6000000000L).withColumn("col1",rand(10)).withColumn("col2",rand(20))
df.selectExpr("id","col1","col2","if(id%2=0,1,0) as key").groupBy("key").agg(avg("col1"),avg("col2")).show()

This code executed without any issues and took around 8 mins. But when I have added 2 more worker nodes (m3.xlarge) and executed the same code using spark-shell on master node, the time increased to 10 mins.

Here is the issue, I think the time should be decreased, not by half, but I should decrease. I have no idea why on increasing worker node same spark job is taking more time. Any idea why this is happening? Am I missing any thing?


Solution

  • @z-star, yes an algorithm might b slow when distributed. I found the solution by using Spark Dynamic Allocation. This enable spark to use only required executors. While the static allocation runs a job on all executors, which was increasing the execution time with more nodes.