Search code examples
pythonapache-sparkpysparkrodeo

pySpark has a worker - driver version conflict when ran in Rodeo


The following simple script works fine in pyspark when it is ran from the terminal:

import pyspark

sc = pyspark.SparkContext()
foo = sc.parallelize([1,2])
foo.foreach(print)

But when ran in Rodeo, it produces an error, most important line of which says:

Exception: Python in worker has different version 2.7 than that in driver 3.5, PySpark cannot run with different minor versions

And the full error output can be found at this link: http://pastebin.com/raw/unGuGLhq

My$SPARK_HOME/conf/spark-env.sh file contains the following lines:

export PYSPARK_PYTHON=python3
export PYSPARK_DRIVER_PYTHON=python3

The problem persists despite that and putting the same lines in ~/.bashrc doesn't solve the problem, either.

Rodeo version: 1.3.0

Spark version: 1.6.1

Platform: Linux


Solution

  • This issue is related to one described here: link

    Rodeo as a desktop app has a hard time working with shell environment variables. The trick is to put variables we'd normally declare in spark-env.sh in Rodeo's .rodeoprofile instead using os module to add them. Specifically in this case adding the following lines to .rodeoprofile helped:

    os.environ["PYSPARK_PYTHON"]="python3"
    os.environ["PYSPARK_DRIVER_PYTHON"]="python3"
    

    (though the second one is redundant and I added it just for consistence as the driver used 3.5 anyway)