python hadoop apache-spark hadoop-yarn pyspark

Apache Spark Python to Scala translation

If I got it right Apache YARN receives Application Master and Node Manager as JAR files. They executed as Java process on the nodes of the YARN cluster. When I write a Spark program using Python, Does it compiled into JAR somehow? If not how come Spark is able to execute Python logic on YARN cluster nodes?

Solution

The PySpark driver program uses Py4J (http://py4j.sourceforge.net/) to launch a JVM and create a Spark Context. Spark RDD operations written in Python are mapped to operations on PythonRDD.

On the remote workers, PythonRDD launches sub-processes which run Python. The data and code is passed from the Remote Worker's JVM to its Python sub-process using pipes.

Therefore, it is necessary for your YARN nodes to have python installed for this to work.

The python code is not compiled to a JAR, but is distributed around the cluster using Spark. In order to make this possible, user functions written in Python are pickled using the following code https://github.com/apache/spark/blob/master/python/pyspark/cloudpickle.py

Source: https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals