Search code examples
apache-sparkamazon-emr

spark-submit on AWS EMR runs but fails on accessing S3


I wrote a Spark application, compiled into a .jar file, and I can use it fine from spark-shell --jars myApplication.jar running on my EMR cluster's master node:

scala> // pass in the existing spark context to the doSomething function, run with a particular argument.
scala> com.MyCompany.MyMainClass.doSomething(spark, "dataset1234")
...

Everything works fine like this. I also setup my main function so I can submit with spark-shell:

package com.MyCompany
import org.apache.spark.sql.SparkSession
object MyMainClass {
  val spark = SparkSession.builder()
    .master(("local[*]"))
    .appName("myApp")
    .getOrCreate()

  def main(args: Array[String]): Unit = {
    doSomething(spark, args(0))
  }
  
  // implementation of doSomething(...) omitted
}

With a very simple main method that just prints out args, I confirmed that I can invoke the main method with spark-submit. However, when I try to submit on my cluster with my actual production job, it fails. I submit it like this:

spark-submit --deploy-mode cluster --class com.MyCompany.MyMainClass s3://my-bucket/myApplication.jar dataset1234

In the console, I see a number of messages including some warnings, but nothing is particularly helpful:

20/11/28 19:28:47 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/11/28 19:28:47 WARN DependencyUtils: Skip remote jar s3://my-bucket/myApplication.jar.
20/11/28 19:28:47 INFO RMProxy: Connecting to ResourceManager at ip-xxx-xxx-xxx-xxx.region.compute.internal/172.31.31.156:8032
20/11/28 19:28:47 INFO Client: Requesting a new application from cluster with 20 NodeManagers
20/11/28 19:28:48 INFO Configuration: resource-types.xml not found
20/11/28 19:28:48 INFO ResourceUtils: Unable to find 'resource-types.xml'.
20/11/28 19:28:48 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (24576 MB per container)
20/11/28 19:28:48 INFO Client: Will allocate AM container, with 2432 MB memory including 384 MB overhead
20/11/28 19:28:48 INFO Client: Setting up container launch context for our AM
20/11/28 19:28:48 INFO Client: Setting up the launch environment for our AM container
20/11/28 19:28:48 INFO Client: Preparing resources for our AM container
20/11/28 19:28:48 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
20/11/28 19:28:51 INFO Client: Uploading resource file:/mnt/tmp/spark-e34d573d-8f23-403c-ac41-aa5154db8ecd/__spark_libs__8971082428743972083.zip -> hdfs://ip-xxx-xxx-xxx-xxx.region.compute.internal:8020/user/hadoop/.sparkStaging/application_1606587406989_0005/__spark_libs__8971082428743972083.zip
20/11/28 19:28:53 INFO ClientConfigurationFactory: Set initial getObject socket timeout to 2000 ms.
20/11/28 19:28:53 INFO Client: Uploading resource s3://my-bucket/myApplication.jar -> hdfs://ip-xxx-xxx-xxx-xxx.region.compute.internal:8020/user/hadoop/.sparkStaging/application_1606587406989_0005/myApplication.jar
20/11/28 19:28:54 INFO S3NativeFileSystem: Opening 's3://my-bucket/myApplication.jar' for reading
20/11/28 19:28:54 INFO Client: Uploading resource file:/mnt/tmp/spark-e34d573d-8f23-403c-ac41-aa5154db8ecd/__spark_conf__5385616689365996012.zip -> hdfs://ip-xxx-xxx-xxx-xxx.region.compute.internal:8020/user/hadoop/.sparkStaging/application_1606587406989_0005/__spark_conf__.zip
20/11/28 19:28:54 INFO SecurityManager: Changing view acls to: hadoop
20/11/28 19:28:54 INFO SecurityManager: Changing modify acls to: hadoop
20/11/28 19:28:54 INFO SecurityManager: Changing view acls groups to:
20/11/28 19:28:54 INFO SecurityManager: Changing modify acls groups to:
20/11/28 19:28:54 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(hadoop); groups with view permissions: Set(); users  with modify permissions: Set(hadoop); groups with modify permissions: Set()
20/11/28 19:28:54 INFO Client: Submitting application application_1606587406989_0005 to ResourceManager
20/11/28 19:28:54 INFO YarnClientImpl: Submitted application application_1606587406989_0005
20/11/28 19:28:55 INFO Client: Application report for application_1606587406989_0005 (state: ACCEPTED)
20/11/28 19:28:55 INFO Client:
         client token: N/A

Then, once per second for several minutes (in this example, 6), I get application reports with state: ACCEPTED until it fails with no useful info.

20/11/28 19:28:56 INFO Client: Application report for application_1606587406989_0005 (state: ACCEPTED)
...
... (lots of these messages)
...
20/11/28 19:31:55 INFO Client: Application report for application_1606587406989_0005 (state: ACCEPTED)
20/11/28 19:34:52 INFO Client: Application report for application_1606587406989_0005 (state: FAILED)
20/11/28 19:34:52 INFO Client:
         client token: N/A
         diagnostics: Application application_1606587406989_0005 failed 2 times due to AM Container for appattempt_1606587406989_0005_000002 exited with  exitCode: 13
Failing this attempt.Diagnostics: [2020-11-28 19:32:24.087]Exception from container-launch.
Container id: container_1606587406989_0005_02_000001
Exit code: 13

[2020-11-28 19:32:24.117]Container exited with a non-zero exit code 13. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
elled)
20/11/28 19:32:22 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
20/11/28 19:32:22 WARN TaskSetManager: Lost task 15.0 in stage 1.0 (TID 135, ip-xxx-xxx-xxx-xxx.region.compute.internal, executor driver): TaskKilled (Stage cancelled)

Eventually, the logs will indicate:

org.apache.spark.sql.AnalysisException: Path does not exist: s3://my-bucket/dataset1234.parquet;
        at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$1(DataSource.scala:759)

My application first creates this file and, on failure to create it, just silently ignores it and continues on (in case the job gets executed, works, and gets executed again, trying to overwrite a file). The second part is that it will read this file and do some additional work. So what I know from this error message is that my application is running, continued past the first part, but apparently Spark isn't able to write files out to S3. It also seems(?) from the second log message that Spark couldn't download the remote jar file from S3. (I did happen to have copied the file into ~hadoop/ before I ran spark-submit, though I don't know if it failed to download from S3 and found the local copy.)

I got my spark-submit command by checking what the EMR AWS CLI Export configuration showed for the step I created the step in the web interface. Is this an issue with EMR somehow not having S3 permissions? That seems unlikely, but what else could be the problem here? It's certainly running my job, but it seems it is successfully finding that the file doesn't exist (ie, has read permission to my bucket), but it wasn't able to create the file.

How can I get better debug info on this? Is there a way I can ensure proper EMR<-->S3 permissions?


Solution

  • Get rid of this .master(("local[*]")). Your maser should not be local when running on the cluster and accessing s3 files