Search code examples
azureapache-sparkazure-hdinsight

how to change python version from 2.7 to 3.5 in hdinsight spark


I have an Azure HDInsightCluster running Spark and verified that it is currently using its default Python version 2.17. However according to MS docs it is possible to change it to 3.5 since both Python 2.17 and 3.5 are the supported Python versions for Spark 2.4 (HDI 4.0).

I tried adding the following values to Advanced pyspark2-env in the Ambari GUI:

export PYSPARK_PYTHON=/usr/bin/anaconda/envs/py35/bin/python
export MY_TEST_ENV=this_is_a_test_if_this_gets_executed

but when running Code in Spark I can see it still uses Python 2.7 !

Moreover it seems that whatever I enter in Advanced spark2-env does not even get executed! When I try this in my spark job:

import sys
import os 
from pyspark import SparkContext, SparkConf
 
if __name__ == "__main__":
    
    # create Spark context with necessary configuration
    conf = SparkConf().setAppName("WordCount").set("spark.hadoop.validateOutputSpecs", "false")
    sc = SparkContext(conf=conf)
    print(os.environ["PYSPARK_PYTHON"])
    print(os.environ["PYSPARK_DRIVER_PYTHON"])
    print(os.environ["MY_TEST_ENV"])

I get the following output:

/usr/bin/anaconda/bin/python
/usr/bin/anaconda/envs/py35/bin/python
Traceback (most recent call last):
  File "wordcount.py", line 12, in <module>
    print(os.environ["MY_TEST_ENV"])
  File "/usr/bin/anaconda/envs/py35/lib/python3.5/os.py", line 725, in __getitem__
    raise KeyError(key) from None
KeyError: 'MY_TEST_ENV'

Solution

  • I found the solution.

    In the config section custom spark2-defaults I had to overwrite the entry used for the variable spark.yarn.appMasterEnv.PYSPARK_PYTHON to /usr/bin/anaconda/envs/py35/bin/python3