Search code examples
scalaapache-sparkbigdataspark-submits3distcp

How can I execute a S3-dist-cp command within a spark-submit application


I have a jar file that is being provided to spark-submit.With in the method in a jar. I’m trying to do a

Import sys.process._
s3-dist-cp —src hdfs:///tasks/ —dest s3://<destination-bucket>

I also installed s3-dist-cp on all salves along with master. The application starts and succeeded without error but does not move the data to S3.


Solution

  • s3-dist-cp is now a default thing on the Master node of the EMR cluster.

    I was able to do an s3-dist-cp from with in the spark-submit successfully if the spark application is submitted in "client" mode.