How can I use python > 2.6.6 with spark on BigInsights on cloud Enterprise clusters?

The version of python with BigInsights is currently 2.6.6. How can I use a different version of Python with my spark jobs running on yarn?

Note that users of BigInsights on cloud do not have root access.


  • Install Anaconda

    This script installs anaconda python on a BigInsights on cloud 4.2 Enterprise cluster. Note that these instructions do NOT work for Basic clusters because you are only able to login to a shell node and not any other nodes.

    Ssh into the mastermanager node, then run (changing the values for your environment):

    export BI_USER=snowch
    export BI_PASS=changeme

    Next run the following. The script attempts to be as idemopotent as possible so it shouldn't matter if you run it multiple times:

    # abort if the script encounters an error or undeclared variables
    set -euo
    CLUSTER_NAME=$(curl -s -k -u $BI_USER:$BI_PASS  -X GET https://${BI_HOST}:9443/api/v1/clusters | python -c 'import sys, json; print(json.load(sys.stdin)["items"][0]["Clusters"]["cluster_name"]);')
    echo Cluster Name: $CLUSTER_NAME
    CLUSTER_HOSTS=$(curl -s -k -u $BI_USER:$BI_PASS  -X GET https://${BI_HOST}:9443/api/v1/clusters/${CLUSTER_NAME}/hosts | python -c 'import sys, json; items = json.load(sys.stdin)["items"]; hosts = [ item["Hosts"]["host_name"] for item in items ]; print(" ".join(hosts));')
    echo Cluster Hosts: $CLUSTER_HOSTS
    wget -c
    # Install anaconda if it isn't already installed
    [[ -d anaconda2 ]] || bash -b
    # You can install your pip modules using something like this:
    # ${HOME}/anaconda2/bin/python -c 'import yourlibrary' || ${HOME}/anaconda2/pip install yourlibrary
    # Install anaconda on all of the cluster nodes
       if [[ "$CLUSTER_HOST" != "$BI_HOST" ]];
          echo "*** Processing $CLUSTER_HOST ***"
          ssh $BI_USER@$CLUSTER_HOST "wget -q -c"
          ssh $BI_USER@$CLUSTER_HOST "[[ -d anaconda2 ]] || bash -b"
          # You can install your pip modules on each node using something like this:
          # ssh $BI_USER@$CLUSTER_HOST "${HOME}/anaconda2/bin/python -c 'import yourlibrary' || ${HOME}/anaconda2/pip install yourlibrary"
          # Set the PYSPARK_PYTHON path on all of the nodes
          ssh $BI_USER@$CLUSTER_HOST "grep '^export PYSPARK_PYTHON=' ~/.bash_profile || echo export PYSPARK_PYTHON=${HOME}/anaconda2/bin/python2.7 >> ~/.bash_profile"
          ssh $BI_USER@$CLUSTER_HOST "sed -i -e 's;^export PYSPARK_PYTHON=.*$;export PYSPARK_PYTHON=${HOME}/anaconda2/bin/python2.7;g' ~/.bash_profile"
          ssh $BI_USER@$CLUSTER_HOST "cat ~/.bash_profile"
    echo 'Finished installing'

    Running a pyspark job

    If you are using pyspark, you can use anaconda python, set the following variables before running the pyspark command:

    export SPARK_HOME=/usr/iop/current/spark-client
    export HADOOP_CONF_DIR=/usr/iop/current/hadoop-client/conf
    # set these to the folders where you installed anaconda
    export PYSPARK_PYTHON=/home/biadmin/anaconda2/bin/python2.7
    export PYSPARK_DRIVER_PYTHON=/home/biadmin/anaconda2/bin/python2.7
    spark-submit --master yarn --deploy-mode client ...
    # NOTE: --deploy-mode cluster does not seem to use the PYSPARK_PYTHON setting

    Zeppelin (optional)

    If you are using Zeppelin (as per these instructions for BigInsights on cloud), set the following variables in

    # set these to the folders where you installed anaconda
    export PYSPARK_PYTHON=/home/biadmin/anaconda2/bin/python2.7
    export PYSPARK_DRIVER_PYTHON=/home/biadmin/anaconda2/bin/python2.7