I'm trying to create an emr spark cluster with a single custom step. The cluster is created successfully however, the step is not correctly defined.
I tried to lunch the same cluster via the web console and get the same results. While I specify the Jar location when I save the step the JAR location is set to command-runner.jar
and the provided JAR path is added to the Arguments list.
CLI Command:
aws emr create-cluster --name 'emr-test' \
--applications Name=Spark \
--release-label emr-5.11.1 \
--auto-terminate \
--instance-type m3.xlarge \
--instance-count 1 \
--ec2-attributes SubnetId=subnet-000000 \
--steps '[{
"Type": "SPARK",
"Name": "spark-program",
"Args": ["--class","--init-keyspaces"],
"Jar": "s3://mybucket/snapshots/0.1.0-SNAPSHOT/2.11/my-spark-assembly-0.1.0-SNAPSHOT.jar",
"ActionOnFailure": "TERMINATE_CLUSTER",
}]' \
--use-default-roles \
--log-uri 's3://mybucket/logs' \
--tags Name='spark-program' Environment='test'
When I check under the Step tab in the console.
JAR location: command-runner.jar
Main class: None
Arguments: spark-submit --class --init-keyspaces
Action on failure: Terminate cluster
JAR location: s3://mybucket/snapshots/0.1.0-SNAPSHOT/2.11/my-spark-assembly-0.1.0-SNAPSHOT.jar
Main class: com.myspark.data.customer.jobs.MyJob
Arguments: spark-submit --class --init-keyspaces
Action on failure: Terminate cluster
I've confirmed the S3 bucket and JAR are in the correct location. I'm getting the same result when using standard syntax as well.
Found that my expectation was incorrect. When creating a new job via the CLI and including only JAR args then a Custom JAR project is created. If spark args (i.e. --conf
) are also passed in to the CLI then a Spark job is created.
These two job types from the web console look different. For example, the JAR location
is set to command-runner.jar
for Spark jobs however for a Custom JAR it is set to the path of the s3 bucket.
AWS Custom Spark Step Documentation https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-submit-step.html