Search code examples
pythonamazon-web-servicesapache-sparkpysparkamazon-emr

Facing error while trying to create transient cluster on AWS emr to run Python script


I am new to aws and trying to create a transient cluster on AWS emr to run a Python script. I just want to run the python script that will process the file and auto terminate the cluster post completion. I have also created a keypair and specified the same.

Command below :

aws emr create-cluster --name "test1-cluster" --release-label emr-5.5.0 --name pyspark_analysis --ec2-attributes KeyName=k-key-pair --applications Name=Hadoop Name=Hive Name=Spark --instance-groups --use-default-roles --instance-type m5-xlarge --instance-count 2 --region us-east-1 --log-uri s3://k-test-bucket-input/logs/ --steps Type=SPARK, Name="pyspark_analysis", ActionOnFailure=CONTINUE, Args=[-deploy-mode,cluster, -master,yarn, -conf,spark.yarn.submit.waitAppCompletion=true, -executor-memory,1g, s3://k-test-bucket-input/word_count.py, s3://k-test-bucket-input/input/a.csv, s3://k-test-bucket-input/output/ ] --auto-terminate

Error message

zsh: bad pattern: Args=[

What I tried :

I looked at the args and the spaces and if accidental characters are introduced or not but does not look like. Surely my syntax is wrong but not sure what I am missing.

What command is expected to do:

its expected to execute word_count.py by reading the input file a.csv and generating the output in b.csv


Solution

  • I think the issue is with the use of spaces in --steps. I formatted the command, so its a bit easier to read where are the spaces (or luck of them):

    aws emr create-cluster \
        --name "test1-cluster" \
        --release-label emr-5.5.0 \
        --name pyspark_analysis \
        --ec2-attributes KeyName=k-key-pair \
        --applications Name=Hadoop Name=Hive Name=Spark \
        --instance-groups --use-default-roles \
        --instance-type m5-xlarge --instance-count 2 \
        --region us-east-1 --log-uri s3://k-test-bucket-input/logs/ \
        --steps Type=SPARK,Name="pyspark_analysis",ActionOnFailure=CONTINUE,Args=[-deploy-mode,cluster,-master,yarn,-conf,spark.yarn.submit.waitAppCompletion=true,-executor-memory,1g,s3://k-test-bucket-input/word_count.py,s3://k-test-bucket-input/input/a.csv,s3://k-test-bucket-input/output/] \
        --auto-terminate