Search code examples
apache-sparkclasspathemramazon-emr

Why Spark on AWS EMR doesn't load class from application fat jar?


My spark application fails to run on AWS EMR cluster. I noticed that this is because some classes are loaded from the path set by EMR and not from my application jar. For example

java.lang.NoSuchMethodError: org.apache.avro.Schema$Field.<init>(Ljava/lang/String;Lorg/apache/avro/Schema;Ljava/lang/String;Ljava/lang/Object;)V
        at com.sksamuel.avro4s.SchemaFor$.fieldBuilder(SchemaFor.scala:424)
        at com.sksamuel.avro4s.SchemaFor$.fieldBuilder(SchemaFor.scala:406)

Here org.apache.avro.Schema is loaded from "jar:file:/usr/lib/spark/jars/avro-1.7.7.jar!/org/apache/avro/Schema.class"

Whereas com.sksamuel.avro4s depends on avro 1.8.1. My application is built as a fat jar and has avro 1.8.1. Why isn't that loaded? Instead of picking 1.7.7 from EMR set classpath.

This is just an example. I see the same with other libraries I include in my application. May be Spark depends on 1.7.7 and I'd have to shade when including other dependencies. But why are the classes included in my app jar not loaded first?


Solution

  • After bit of reading I realized that this is how class loading works in Spark. There is a hook to change this behavior spark.executor.userClassPathFirst. It didn't quite work when I tried and its marked as experimental. I guess the best way to proceed is to shade dependencies. Given the number of libraries Spark and its components pull, this might be quite a lot shading with complicated Spark apps.