Search code examples
javaamazon-web-servicesapache-sparkamazon-emr

Overriding default aws-sdk jar on AWS EMR master node


I'm running into a problem with running my application on EMR master node. It needs to access some AWS SDK methods added in ver 1.11. All the required dependencies were bundled into a fat jar and the application works as expected on my dev box.

However, if the app is executed on EMR master node, it fail with NoSuchMethodError exception when calling a method, added in AWS SDK ver 1.11+, e.g.

java.lang.NoSuchMethodError:
 com.amazonaws.services.sqs.model.SendMessageRequest.withMessageDeduplicationId(Ljava/lang/String;)Lcom/amazonaws/services/sqs/model/SendMessageRequest;

I tracked it down to the classpath parameter passed to JVM instance, started by spark-submit:

-cp /usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf/:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/lib/spark/conf/:/usr/lib/spark/jars/*:/etc/hadoop/conf/

In particular, it loads /usr/share/aws/aws-java-sdk/aws-java-sdk-sqs-1.10.75.1.jar instead of using ver 1.11.77 from my fat jar.

Is there a way to force Spark to use the AWS SDK version I need?


Solution

  • Here is what I learned trying to troubleshoot this.

    The default class path parameter is constructed using spark.driver.extraClassPath settings from /etc/spark/conf/spark-defaults.conf. spark.driver.extraClassPath contains a reference to the older version AWS SDK, which is located in /usr/share/aws/aws-java-sdk/*

    To use the newer version of AWS API, I uploaded the jars to a dir I created in the home dir and specified it in --driver-class-path spark-submit parameter:

    --driver-class-path '/home/hadoop/aws/*'