I'm submitting a Spark job to EMR via AWSCLI, EMR steps and spark configs are provided as separate json files. For some reason the name of my main class gets passed to my Spark jar as an unnecessary command line argument, resulting in a failed job.
AWSCLI command:
aws emr create-cluster \
--name "Spark-Cluster" \
--release-label emr-5.5.0 \
--instance-groups \
InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge \
InstanceGroupType=CORE,InstanceCount=20,InstanceType=m3.xlarge \
--applications Name=Spark \
--use-default-roles \
--configurations file://conf.json \
--steps file://steps.json \
--log-uri s3://blah/logs \
The json file describing my EMR Step:
[
{
"Name": "RunEMRJob",
"Jar": "s3://blah/blah.jar",
"ActionOnFailure": "TERMINATE_CLUSTER",
"Type": "CUSTOM_JAR",
"MainClass": "blah.blah.MainClass",
"Args": [
"--arg1",
"these",
"--arg2",
"get",
"--arg3",
"passed",
"--arg4",
"to",
"--arg5",
"spark",
"--arg6",
"main",
"--arg7",
"class"
]
}
]
The argument parser in my main class throws an error (and prints the parameters provided):
Exception in thread "main" java.lang.IllegalArgumentException: One or more parameters are invalid or missing:
blah.blah.MainClass --arg1 these --arg2 get --arg3 passed --arg4 to --arg5 spark --arg6 main --arg7 class
So for some reason the main class that I define in steps.json leaks into my separately provided command line arguments.
What's up?
I misunderstood how EMR steps work. There were two options for resolving this:
I could use Type = "CUSTOM_JAR" with Jar = "command-runner.jar" and add a normal spark-submit call to Args.
Using Type = "Spark" simply adds the "spark-submit" call as the first argument, one still needs to provide a master, jar location, main class etc...