Search code examples
apache-sparkoptimizationcode-generationbytecodejit

How does Spark do bytecode to machine code instructions run time conversion?


After reading some articles about Whole State Code Generation, spark does bytecode optimizations to convert a query plan to an optimized execution plan.

https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-sql-whole-stage-codegen.html

Now my next question is but still after doing these optimizations related to bytecodes and all, it might still be plausible that conversion of those bytecode instructions to machine code instructions could be a possible bottleneck because this is done by JIT alone during the runtime of the process and for this optimization to take place JIT has to have enough runs.

So does spark do anything related to dynamic/runtime conversion of optimized bytecode ( which is an outcome of whole stage code gen) to machine code or does it rely on JIT to convert those byte code instructions to machine code instructions. Because if it relies on JIT then there are certain uncertainties involved.


Solution

  • spark does bytecode optimizations to convert a query plan to an optimized execution plan.

    Spark SQL does not do bytecode optimizations.

    Spark SQL simply uses CollapseCodegenStages physical preparation rule and eventually converts a query plan into a single-method Java source code (that Janino compiles and generates the bytecode).

    So does spark do anything related to dynamic/runtime conversion of optimized bytecode

    No.


    Speaking of JIT, WholeStageCodegenExec does this check whether the whole-stage codegen generates "too long generated codes" or not that could be above spark.sql.codegen.hugeMethodLimit Spark SQL internal property (that is 8000 by default and is the value of HugeMethodLimit in the OpenJDK JVM settings).

    The maximum bytecode size of a single compiled Java function generated by whole-stage codegen. When the compiled function exceeds this threshold, the whole-stage codegen is deactivated for this subtree of the current query plan. The default value is 8000 and this is a limit in the OpenJDK JVM implementation.


    There are not that many physical operators that support CodegenSupport so reviewing their doConsume and doProduce methods should reveal whether if at all JIT might not kick in.