I have a Spark Job that is using some external libraries to work. When I run the job locally through the main method from IntelliJ the job runs without any issues. However, when I assembly my job into a jarfile
(I create an UberJAR using sbt
) and I try to run it on EMR, it throws a ClassNotFoundException
.
I have checked that the class is indeed inside the jarfile
so it should be available for the job to run. I have also tried the spark-submit
options spark.driver.extraClassPath
, spark.driver.extraLibraryPath
, spark.executor.extraClassPath
and spark.executor.extraLibraryPath
as well as spark.driver.userClassPathFirst
and spark.executor.userClassPathFirst
. Also, I tried doing in the code sparkContext.addJar("/mnt/jars/myJar")
. None of them worked for me.
Also, when running on EMR I can read the log that says that the JAR was added (not sure if it is loaded on the classpath, but it should because other classes are being loaded properly):
15/11/02 04:10:26 INFO SparkContext: Added JAR file:///mnt/my-app-1.0-SNAPSHOT.jar at http://172.31.42.244:44471/jars/my-app-1.0-SNAPSHOT.jar with timestamp 1446437426661
I am running out of ideas about what else to try. I have been researching and I see few tickets on the Spark JIRA board but nothing similar to my issue.
I am running on EMR release-label 4.1.0 (Spark 1.5.0), Java 7, sbt 0.13.7 and Scala 2.10.5.
It turned out to be a problem with SerializationUtils
from Apache Commons Lang. There is an open issue where the class will throw a ClassNotFoundException
even if the class is in the classpath in a multiple-classloader environment: https://issues.apache.org/jira/browse/LANG-1049
We moved away from the library and our Spark job is working fine now. The issue was not related with Spark finally.