Search code examples
pythonoptimizationapache-sparkleft-joinquery-optimization

How to accelerate leftouterjoin in Spark?


How to accelerate leftouterjoin in spark
I run a job in Spark.
The leftouterjoin become the bottleneck for the whole job.
So it is necessary to optimize the leftouterjoin in spark.
It is a leftouterjoin between 2 millions record of data sets.
It is taking 8 minutes to compute the leftouterjoin 13

leftOuterJoin at :26
2015/07/28 04:38:16 8.3 min 7/7
152.7 MB 50.5 MB 278.5 MB


Solution

  • Have you used partitionBy and persist in your RDD ?

    To improve performance, I suggest you should use partionby and persist on the left(in left outer join) RDD.

    sample Code:

    val leftRDD = sc.textFile(//..).partitionBy(numPartitions).persist()
    

    numPartitions : depends on your cluster hardware. Number of cores (if you have 4 core machine then opt for numPartitions = 8)