Search code examples
scalaapache-sparkamazon-emrspark-submit

How to use Spark-Submit to run a scala file present on EMR cluster's master node?


So, I connect to my EMR cluster's master node using SSH. This is the file structure present in the master node:

|-- AnalysisRunner.scala
|-- AutomatedConstraints.scala
|-- deequ-1.0.1.jar
|-- new
|   |-- Auto.scala
|   `-- Veri.scala
|-- VerificationConstraints.scala
`-- wget-log

Now, I would first run spark-shell --conf spark.jars=deequ-1.0.1.jar

And once I got to the scala prompt, I would use :load new/Auto.scala to run my scala script.

WHAT I WANT TO DO:

While on my EMR cluster's master node, I would like to run a single spark-submit command that would help me achieve exactly what I was doing earlier.

I'm new to this, so can anyone help me with the command?


Solution

  • For any beginner who might be stuck here:

    You will need to have an IDE (I used IntelliJ IDEA). Steps to follow:

    1. Create a scala project - put down all dependencies you need, in the build.sbt file.
    2. Create a package (say 'pkg') and under it create a scala object (say 'obj').
    3. Define a main method in your scala object and write your logic.
    4. Process the project to form a single .jar file. (use IDE tools or run 'sbt package' in your project directory)
    5. Submit using the following command
    spark-submit --class pkg.obj 
    --jars <path to your dependencies (if any)> 
    <path to the jar created from your code> 
    <command line arguments (if any)>
    

    This worked for me. Note - if you are running this on an EMR cluster, make sure all paths are specified based on either

    1. filesystem present on the cluster
    2. an S3 path