Search code examples
python-3.xapache-sparkpysparkpython-3.5apache-zeppelin

Using pyspark in Zeppelin with python3 on Spark 2.1.0


I am trying to run pyspark in Zeppelin and python3 (3.5) against Spark 2.1.0. I have got the pyspark shell up and running with python3 but flipping over to Zeppelin connecting to the same local cluster gives:

Exception: Python in worker has different version 3.5 than that in driver 2.7, PySpark cannot run with different minor versions

I have modified the default spark-env.sh as follows: (unmodified lines omitted for brevity)

SPARK_LOCAL_IP=127.0.0.1
SPARK_MASTER_HOST="localhost"
SPARK_MASTER_WEBUI_PORT=8080
SPARK_MASTER_PORT=7077
SPARK_DAEMON_JAVA_OPTS="-Djava.net.preferIPv4Stack=true"
export PYSPARK_PYTHON=/Library/Frameworks/Python.framework/Versions/3.5/bin/python3
export PYSPARK_DRIVER_PYTHON=/Library/Frameworks/Python.framework/Versions/3.5/bin/ipython

Staring things up ./bin/pyspark and all is good in the shell.

Zeppelin config has been modified in zeppelin-site.xml only to move the ui port away from 8080 to 8666. `zeppelin-env.sh' has been modified as follows: (showing only mods/additions)

export MASTER=spark://127.0.0.1:7077
export SPARK_APP_NAME=my_zeppelin-mf
export PYSPARK_PYTHON=/Library/Frameworks/Python.framework/Versions/3.5/bin/python3
export PYSPARK_DRIVER_PYTHON=/Library/Frameworks/Python.framework/Versions/3.5/bin/ipython
export PYTHONPATH=/Library/Frameworks/Python.framework/Versions/3.5/bin/python3

I've tried using Anaconda but python 3.6 is currently creating issues with Spark. Also, I've used a bunch of combinations of the above config settings w/o success.

There is a setting referenced in the configs zeppelin.pyspark.python which defaults to python but it is unclear from the docs how/where to adjust that to python3. To help eliminate OSX specifics, I was able to replicate this failure on LinuxMint 18.1 as well.

  • Running local on OSX 10.11.6
  • Spark is 2.1.0-bin-hadoop2.7
  • Zeppelin 0.7.0-bin-all

So I've been rifling through the Zeppelin docs and the Internet trying to find the proper config setting to get Zeppelin to run as a 3.5 driver. With hope I'm missing something obvious, but I cannot seem to track this one down. Hoping someone has done this successfully and can help identify my error.

Thank you.


Solution

  • Naturally, something worked right after posting this...

    In the Zeppelin config at ./conf/interpreter.json, for one of my notebooks I found the config:

     "properties": {
        ...
        "zeppelin.pyspark.python": "python",
        ... 
     }
    

    Changing this to:

     "properties": {
        ...
        "zeppelin.pyspark.python": "python3",
        ... 
     }
    

    (Combined with the same settings as above)

    Has had the desired effect of getting the notebook work with python 3.5. However, this seems a bit clunky/hacky and I suspect there is a more elegant way to do this. So I won't call this a solution/answer, but more of a work around.