I have an EMR cluster with numbers of steps. I am trying to analyze log data coming in every week. I want to run the same steps every week on appended data.
Long-running cluster:
data source
(load or copy records from log file if it is subsequent run)How can I run the same steps every week on the cluster?
Or do I need to spin up new cluster every week?
It would be great if I could get some guidance on type of data source in such a scenario which handle huge data.
You can submit new steps to a cluster by calling add-steps — AWS CLI Command Reference.
Thus, you would need a cron
job somewhere that calls the cluster to add the steps. You could create the cron job on the Master node, or there are a myriad of Hadoop tools that can schedule and orchestrate jobs.
You certainly do not require a new node since you have a cluster already operating.