The following simple script works fine in pyspark when it is ran from the terminal:
import pyspark
sc = pyspark.SparkContext()
foo = sc.parallelize([1,2])
foo.foreach(print)
But when ran in Rodeo, it produces an error, most important line of which says:
Exception: Python in worker has different version 2.7 than that in driver 3.5, PySpark cannot run with different minor versions
And the full error output can be found at this link: http://pastebin.com/raw/unGuGLhq
My$SPARK_HOME/conf/spark-env.sh
file contains the following lines:
export PYSPARK_PYTHON=python3
export PYSPARK_DRIVER_PYTHON=python3
The problem persists despite that and putting the same lines in ~/.bashrc
doesn't solve the problem, either.
Rodeo version: 1.3.0
Spark version: 1.6.1
Platform: Linux
This issue is related to one described here: link
Rodeo as a desktop app has a hard time working with shell environment variables. The trick is to put variables we'd normally declare in spark-env.sh in Rodeo's .rodeoprofile instead using os module to add them. Specifically in this case adding the following lines to .rodeoprofile helped:
os.environ["PYSPARK_PYTHON"]="python3"
os.environ["PYSPARK_DRIVER_PYTHON"]="python3"
(though the second one is redundant and I added it just for consistence as the driver used 3.5 anyway)