Search code examples
python-3.xamazon-web-servicesapache-sparkpysparkamazon-emr

aws emr can't change default pyspark python on bootstrap


I am using aws with emr, and trying to change to bootstrap script in order to set the default python in pyspark to be python 3, I am following this tutorial

this is changing the /usr/lib/spark/conf/spark-env.sh file, but does not change the python version in pyspark, I am still getting jobs done with python 2.7. this is only working when I ssh to the machine and specifically use

$source /usr/lib/spark/conf/spark-env.ssh

When I try to add this line to the bootstrap script I am getting bootstrap error that the file is not found.

/bin/bash: /usr/lib/spark/conf/spark-env.sh: No such file or directory

I assume that the file does not exist in this stage. How can I set the pyspark python to be python 3 in the bootstrap script?


Solution

  • Add the following code to software configuration (create emr -> step1: software and steps -> edit software configuration -> enter configuration)

    [
      {
         "Classification": "spark-env",
         "Configurations": [
           {
             "Classification": "export",
             "Properties": {
                "PYSPARK_PYTHON": "/usr/bin/python3"
              }
           }
        ]
      }
    ]