Search code examples
amazon-web-servicesamazon-emr

Run the cluster steps for file upload on EMR


I have an EMR cluster with numbers of steps. I am trying to analyze log data coming in every week. I want to run the same steps every week on appended data.

Long-running cluster:

  1. Load Log file from data source (load or copy records from log file if it is subsequent run)
  2. Analyze data
  3. Return data to the destination

How can I run the same steps every week on the cluster?

Or do I need to spin up new cluster every week?

It would be great if I could get some guidance on type of data source in such a scenario which handle huge data.


Solution

  • You can submit new steps to a cluster by calling add-steps — AWS CLI Command Reference.

    Thus, you would need a cron job somewhere that calls the cluster to add the steps. You could create the cron job on the Master node, or there are a myriad of Hadoop tools that can schedule and orchestrate jobs.

    You certainly do not require a new node since you have a cluster already operating.