Search code examples
pythonamazon-web-servicespysparkamazon-emr

Cluster terminates but works locally


I am trying to deploy a spark job (using pyspark librairies : ML) on aws EMR. I want to create a simple cluster with a single instance, to understand how EMR works.

I create the cluster with the console with the following configuration :

spark-submit --deploy-mode cluster s3://bucket/key/file.py

My step fails with a bunch of error logs that I struggle to understand besides this on :

  File "PowerProdPredictionEmr.py", line 261
df = df.select("Perimetre", *target_exprs, *window_exprs, "rn")

SyntaxError: invalid syntax            

Which I don't understand since it's working locally on my machine.

Here is the code :

...
window_exprs = [df.power_prod[i] for i in range(w*sample_week)]
df = df.select("Perimetre", *target_exprs, *window_exprs, "rn")
...

Any idea ? I can add other log files if necessary.


Solution

  • As @user10938362 pointed out, even though EMR supports python up to version 3.6, version 2.x is the default one installed on instances.

    To set up python 3 as default version, you can add the following code in the "Edit Software / Enter configuration".

    [
      {
         "Classification": "spark-env",
         "Configurations": [
           {
             "Classification": "export",
             "Properties": {
                "PYSPARK_PYTHON": "/usr/bin/python3"
              }
           }
        ]
      }
    ]
    

    All python versions issues will be solved.