When I join two dataframes using left join like this:
df1.join(broadcast(df2), $"id" === $"id1" || $"id2" === $"id3", "left")
Without $"id2" === $"id3"
, it executes very quickly but when both conditions are present, it becomes BroadcastNestedLoopJoin
and becomes very very slow.
Any ideas how I can improve this?
BroadcastNestedLoopJoin means nested for-loops to join your data-frames. It will always give a degraded performance.
Can you try below solution:
val resultPart1 = df1.join(broadcast(df2), $"id" === $"id1", "left")
val resultPart2 = df1.join(broadcast(df2), $"id2" === $"id3", "left")
val resultDF = resultPart1.unionByName(resultPart2)
Union causes zero shufflings of data across executors.Hence produces faster results