Search code examples
amazon-web-servicesapache-sparkamazon-emr

AWS EMR Spark --properties-file Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found


I'm trying to submit the Spark application from AWS EMR emr-5.20.0 master node with the following command:

spark-submit --executor-memory 4g --deploy-mode cluster --master yarn --class com.example.Application --properties-file config.conf s3://example-jobs/application.jar

but it fails with the following error:

Exception in thread "main" java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found

The reason for this is the following parameter:

--properties-file config.conf

What am I doing wrong and how to properly pass the property file to AWS EMR Apache Spark?


Solution

  • By passing --properties-file, you are completely overriding a lot of default values that EMR provides in /etc/spark/conf/spark-defaults.conf and specifically missing out on a property that sets up the classpath to include the EMRFS jar, which is what is causing the particular error you're hitting.

    Rather than specifying your own full properties file, you may configure Spark at cluster creation time by following https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-configure.html. Using this method to configure Spark will cause your own provided values to appear in /etc/spark/conf/spark-defaults.conf, along with those provided by EMR by default. Your provided values will override any default values that EMR would otherwise provide there.

    For any configuration that is not appropriate at a cluster level but rather at an individual application level, you may pass additional configuration to spark-submit using something like the following:

    spark-submit --conf KEY1=VALUE1 --conf KEY2=VALUE2 --executor-memory 4g --deploy-mode cluster --class ... --jar ... [args]

    BTW, you don't need to specify --master yarn because this is already specified in /etc/spark/conf/spark-defaults.conf. Also, the default executor memory in /etc/spark/conf/spark-defaults.conf is already usually around 4-5g, depending upon the instance types in your cluster.