Search code examples
pythonhadoopapache-sparkhadoop-yarnpyspark

Apache Spark Python to Scala translation


If I got it right Apache YARN receives Application Master and Node Manager as JAR files. They executed as Java process on the nodes of the YARN cluster. When I write a Spark program using Python, Does it compiled into JAR somehow? If not how come Spark is able to execute Python logic on YARN cluster nodes?


Solution

  • The PySpark driver program uses Py4J (http://py4j.sourceforge.net/) to launch a JVM and create a Spark Context. Spark RDD operations written in Python are mapped to operations on PythonRDD.

    On the remote workers, PythonRDD launches sub-processes which run Python. The data and code is passed from the Remote Worker's JVM to its Python sub-process using pipes.

    Therefore, it is necessary for your YARN nodes to have python installed for this to work.

    The python code is not compiled to a JAR, but is distributed around the cluster using Spark. In order to make this possible, user functions written in Python are pickled using the following code https://github.com/apache/spark/blob/master/python/pyspark/cloudpickle.py

    Source: https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals