scala jar executable-jar amazon-emr spark-submit

AWS EMR Spark Cluster - Steps with Scala fat JAR, can't find MainClass

I have a fat jar, written in Scala, packaged by sbt. I need to use it in a Spark cluster in AWS EMR.

It functions fine if I manually spin up the cluster, copy the jar to the master and run a spark-submit job using a command like this...

spark-submit --class org.company.platform.package.SparkSubmit --name platform ./platform-assembly-0.1.0.jar arg0 arg1 arg2

But... if I try to add it as a step to the EMR cluster, it fails. The log to stderr looks like this...

Exception in thread "main" java.lang.ClassNotFoundException: package.SparkSubmit
  at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
  at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
  at java.security.AccessController.doPrivileged(Native Method)
  at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
  at java.lang.Class.forName0(Native Method)
  at java.lang.Class.forName(Class.java:278)
  at org.apache.hadoop.util.RunJar.run(RunJar.java:214)
  at org.apache.hadoop.util.RunJar.main(RunJar.java:136)

The relevant settings in my build.sbt look like this...

lazy val root = (project in file(".")).
  settings(
    name := "platform",
    version := "0.1.0",
    scalaVersion := "2.10.5",
    organization := "org.company",
    mainClass in Compile := Some("package/SparkSubmit")
  )

The corresponding file with my MainClass looks like...

package org.company.platform.package

object SparkSubmit {
  def main(args: Array[String]): Unit = {
    // do stuff
  }
}

In EMR Console... in the "Add Step" dialogue... next to the "Arguments" box, it says...

"These are passed to the main function in the JAR. If the JAR does not specify a main class in its manifest file you can specify another class name as the first argument."

I'd think because I DO specify a main class in the build.sbt, I'd be fine... but it fails and doesn't log anything about the failure. If I try to specify the main class as the first arg, it logs the failure I posted above.

I think it's probably a formatting problem, but I can't sort out how to fix it, and no examples turn up. I've tried submitting the following as args in the "Add Step" dialog...

arg0 arg1 arg2
package.SparkSubmit arg0 arg1 arg2
package/SparkSubmit arg0 arg1 arg2
org.company.platform.package.SparkSubmit arg0 arg1 arg2

A few others too, but nothing works.

Version info... EMR 4.3 Spark 1.6 Scala 2.10 sbt 0.13.9

Any ideas what dumb mistake I'm making that's not letting EMR/Spark find my main class?

Thanks.

Solution

EDIT - got this to "work" by making problems 1-6 go away, but then the cluster just kind of sat there saying it was "running" the first step, but it never finished. I had mistakenly set step type to "custom jar" instead of "spark application". After switching that, I think only the fix for "problem 1" was relevant, and that alone may have fixed my problem. I had to back out the fixes to problems 2, 3 and 5 below to get it working with "spark application" steps, and I suspect I can back out the rest of them as well. END EDIT

I spent a long time getting this to work. I'll post the errors and fixes sequentially in case it's useful for someone else down the road.

Problem 1

no matter what I passed in as the first arg to try and point to the MainClass... I got the same error. The problem was in my build.sbt. I (wrongly) thought organization and name in root were enough to provide the package prefix.

I changed mainClass in build.sbt to match my declared package at the top of the file with my SparkSubmit object in it...

mainClass in Compile := Some("org.company.platform.package.SparkSubmit")

then in the "Add Step" dialog, I just passed in the args, no class designation... so just "arg0 arg1 arg2".

Interesting reference if you want to set different MainClass's in the manifest vs. the run... How to set main class in build?

Problem 2

Exception in thread "main" org.apache.spark.SparkException: A master URL must be set in your configuration

I found this reference... https://spark.apache.org/docs/latest/submitting-applications.html#master-urls

I didn't know which one to use, but since EMR uses Yarn, I set it to "yarn". This was wrong. (leaving it in as a record of the subsequent error it generated) In SparkSubmit.main(), I set master URL like this...

val conf = 
  new SparkConf()
  .setMaster("yarn")
  .setAppName("platform")

Problem 3

The master URL error went away, and now this was my error...

Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/SparkConf

In my build.sbt... I had spark-core and spark-sql listed as "provided" in libraryDependencies... I have no idea why this doesn't work as an EMR Step since the cluster has Spark loaded... but I removed that and changed it to...

libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-core" % "1.6.0", // % "provided",
  "org.apache.spark" %% "spark-sql" % "1.6.0", //  % "provided",
  ...
)

Note - after removing "provided" I got a new error, but changing the versions of spark-core and spark-sql to 1.6.0 to match EMR 4.3 made that go away.

Problem solved... new one created!

Problem 4

Exception in thread "main" com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'akka.version'

The answer was here... https://doc.akka.io/docs/akka/snapshot/general/configuration.html#when-using-jarjar-onejar-assembly-or-any-jar-bundler

Basically, Akka's reference.conf was getting lost. My build.sbt mergeStrategy looked like this...

mergeStrategy in assembly <<= (mergeStrategy in assembly) { (old) =>
   {
    case PathList("META-INF", xs @ _*) => MergeStrategy.discard
    case _ => MergeStrategy.first
   }
}

I modified it to look like this instead...

mergeStrategy in assembly <<= (mergeStrategy in assembly) { (old) =>
   {
    case "reference.conf" => MergeStrategy.concat
    case "application.conf" => MergeStrategy.concat
    case PathList("META-INF", xs @ _*) => MergeStrategy.discard
    case _ => MergeStrategy.first
   }
}

Problem 5

I guess "yarn" wasn't the right choice in problem 2. I got this error...

Exception in thread "main" org.apache.spark.SparkException: Could not parse Master URL: 'yarn'

I changed the url to "local[2]"...

val conf = 
  new SparkConf()
  .setMaster("local[2]")
  .setAppName("starling_for_mongo")

No valid reason for that value... not sure how many threads I actually need... or where this is even applied... is it in the master, or in some vm somewhere... i'm not sure. Need to understand this more, but I just copied what was here, because... uh... why not? https://spark.apache.org/docs/1.6.1/configuration.html#spark-properties

Need to understand what's being set here.

Problem 6

Next came a lot of Serialization errors. I don't understand why, when all this code ran without any problems as a manual spark-submit or in spark-shell. I fixed it by essentially going through and making every class extend Serializable.

The End

That was my journey getting a working jar written in scala and compiled with sbt to function as a step in an EMR spark cluster. I hope this helps someone else out there.