Search code examples
amazon-web-servicesapache-sparkemramazon-emr

How to create EMR cluster on demand and execute aws emr command?


I want to execute Spark jobs on demand. So, only when I receive a trigger event, I want to execute a Spark job, using the inputs arriving with this trigger event. Since trigger events are not frequent, I do not want to use Spark Streaming. My goal is to deploy the tool in AWS EMR cluster. I want to be able to create EMR cluster on demand (by triggers), execute Spark job there and switch off a cluster. Is there any good example of how to handle these operations from Scala?


Solution

    • AWS Data Pipeline seems to be a right solution for the problem you defined. AWS Data Pipeline allows you to connect multiple ranges of services within your AWS infrastructures such as storage and processing.

    • You can create a EMR job using an EMRActivity in AWS Data pipeline. The pipeline will trigger when it meets a pre-condition or at a scheduled interval.

    • It will set up an EMR cluster with the specification you specified and the Spark step you defined

    • The cluster can be terminated automatically when the job is completed.

    This question on SO will get you started.

    • You can also spin up an AWS Data Pipeline using this definition while creating a pipeline using Choose a Template option. For this option, you can use the template shared above.