Search code examples
scalaamazon-web-servicesapache-sparkemr

Why does running Spark job fail to find classes inside uberjar on EMR while it works locally fine?


I have a Spark Job that is using some external libraries to work. When I run the job locally through the main method from IntelliJ the job runs without any issues. However, when I assembly my job into a jarfile (I create an UberJAR using sbt) and I try to run it on EMR, it throws a ClassNotFoundException.

I have checked that the class is indeed inside the jarfile so it should be available for the job to run. I have also tried the spark-submit options spark.driver.extraClassPath, spark.driver.extraLibraryPath, spark.executor.extraClassPath and spark.executor.extraLibraryPath as well as spark.driver.userClassPathFirst and spark.executor.userClassPathFirst. Also, I tried doing in the code sparkContext.addJar("/mnt/jars/myJar"). None of them worked for me.

Also, when running on EMR I can read the log that says that the JAR was added (not sure if it is loaded on the classpath, but it should because other classes are being loaded properly):

15/11/02 04:10:26 INFO SparkContext: Added JAR file:///mnt/my-app-1.0-SNAPSHOT.jar at http://172.31.42.244:44471/jars/my-app-1.0-SNAPSHOT.jar with timestamp 1446437426661

I am running out of ideas about what else to try. I have been researching and I see few tickets on the Spark JIRA board but nothing similar to my issue.

I am running on EMR release-label 4.1.0 (Spark 1.5.0), Java 7, sbt 0.13.7 and Scala 2.10.5.


Solution

  • It turned out to be a problem with SerializationUtils from Apache Commons Lang. There is an open issue where the class will throw a ClassNotFoundException even if the class is in the classpath in a multiple-classloader environment: https://issues.apache.org/jira/browse/LANG-1049

    We moved away from the library and our Spark job is working fine now. The issue was not related with Spark finally.