Search code examples
apache-sparkapache-spark-sql

Meaning of Exchange in Spark Stage


Can anyone explain me the meaning of exchange in my spark stages in spark DAG. Most of my stages either starts or end in exchange.

1). WholeStageCodeGen -> Exchange 2). Exchange -> WholeStageCodeGen -> SortAggregate -> Exchange


Solution

  • Whole-stage code generation is a technique inspired by modern compilers to collapse the entire query into a single function.

    Prior to whole-stage code generation, each physical plan is a class with the code defining the execution. With whole-stage code generation, all the physical plan nodes in a plan tree work together to generate Java code in a single function for execution. This Java code is then turned into JVM bytecode using Janino, a fast Java compiler. Then JVM JIT kicks in to optimize the bytecode further and eventually compiles them into machine instructions.

    For example

    == Physical Plan ==
    *Project [id#27, token#28, token#6]
    +- *SortMergeJoin [id#27], [id#5], Inner
       :- *Sort [id#27 ASC NULLS FIRST], false, 0
       :  +- Exchange hashpartitioning(id#27, 200)
    

    Where ever you see *, it means that wholestagecodegen has generated hand-written code before the aggregation. Exchange means the Shuffle Exchange between jobs. Exchange does not have whole-stage code generation because it sends data across the network.