Search code examples
apache-sparkjoin

Avoid Broadcast nested loop join


When I join two dataframes using left join like this:

df1.join(broadcast(df2), $"id" === $"id1" || $"id2" === $"id3", "left")

Without $"id2" === $"id3", it executes very quickly but when both conditions are present, it becomes BroadcastNestedLoopJoin and becomes very very slow.

Any ideas how I can improve this?


Solution

  • BroadcastNestedLoopJoin means nested for-loops to join your data-frames. It will always give a degraded performance.

    Can you try below solution:

    val resultPart1 = df1.join(broadcast(df2), $"id" === $"id1", "left")
    val resultPart2 = df1.join(broadcast(df2),  $"id2" === $"id3", "left")
    
    val resultDF = resultPart1.unionByName(resultPart2)
    

    Union causes zero shufflings of data across executors.Hence produces faster results