Search code examples
scalaapache-sparkamazon-emr

Spark files not found in cluster deploy mode


I'm trying to run a Spark job in cluster deploy mode by issuing in the EMR cluster master node:

spark-submit --master yarn \
--deploy-mode cluster \
--files truststore.jks,kafka.properties,program.properties \ 
--class com.someOrg.somePackage.someClass s3://someBucket/someJar.jar kafka.properties program.properties

I'm getting the following error, which states that the file can not be found at the Spark executor working directory:

//This is me printing the Spark executor working directory through SparkFiles.getRootDirectory()
20/07/03 17:53:40 INFO Program$: This is the path: /mnt1/yarn/usercache/hadoop/appcache/application_1593796195404_0011/spark-46b7fe4d-ba16-452a-a5a7-fbbab740bf1e/userFiles-9c6d4cae-2261-43e8-8046-e49683f9fd3e
        
//This is me trying to list the content for that working directory, which turns out empty.
20/07/03 17:53:40 INFO Program$: This is the content for the path:
                
//This is me getting the error:
    20/07/03 17:53:40 ERROR ApplicationMaster: User class threw exception: java.nio.file.NoSuchFileException: /mnt1/yarn/usercache/hadoop/appcache/application_1593796195404_0011/spark-46b7fe4d-ba16-452a-a5a7-fbbab740bf1e/userFiles-9c6d4cae-2261-43e8-8046-e49683f9fd3e/program.properties
                java.nio.file.NoSuchFileException: /mnt1/yarn/usercache/hadoop/appcache/application_1593796195404_0011/spark-46b7fe4d-ba16-452a-a5a7-fbbab740bf1e/userFiles-9c6d4cae-2261-43e8-8046-e49683f9fd3e/program.properties
                    at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
                    at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
                    at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
                    at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214)
                    at java.nio.file.Files.newByteChannel(Files.java:361)
                    at java.nio.file.Files.newByteChannel(Files.java:407)
                    at java.nio.file.spi.FileSystemProvider.newInputStream(FileSystemProvider.java:384)
                    at java.nio.file.Files.newInputStream(Files.java:152)
                    at ccom.someOrg.somePackage.someHelpers$.loadPropertiesFromFile(Helpers.scala:142)
                    at com.someOrg.somePackage.someClass$.main(someClass.scala:33)
                    at com.someOrg.somePackage.someClass.main(someClass.scala)
                    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
                    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
                    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
                    at java.lang.reflect.Method.invoke(Method.java:498)
                    at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:685)

This is the function I use to attempt to read the properties files passed as arguments:

def loadPropertiesFromFile(path: String): Properties = {
    val inputStream = Files.newInputStream(Paths.get(path), StandardOpenOption.READ)
    val properties  = new Properties()
    properties.load(inputStream)
    properties
  }

Invoked as:

val spark = SparkSession.builder().getOrCreate()
import spark.implicits._
val kafkaProperties = loadPropertiesFromFile(SparkFiles.get(args(1)))
val programProperties = loadPropertiesFromFile(SparkFiles.get(args(2)))
//Also tried loadPropertiesFromFile(args({1,2}))

The program works as expected when issued with client deploy mode:

spark-submit --master yarn \
--deploy-mode client \
--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.5 \
--files truststore.jks program.jar com.someOrg.somePackage.someClass kafka.properties program.properties

This happens in Spark 2.4.5 / EMR 5.30.1.

Additionally, when I try to configure this job as an EMR step it does not even work in client mode. Any clue on how are the resource files passed through --files option managed/persisted/available in EMR?


Solution

  • Option 1: Put those files in s3 and pass the s3 path. Option 2: copy those files to each node in a specific location(using bootstrap) and pass the absolute path of files.