Search code examples
hadoopamazon-web-servicesemr

Clean AWS EMR to allow reuse


I have several task I'm preforming on AWS EMRs which don't share data and I would like to use the same EMR to perform them one after another. Is there a way to clean a running EMR back to its initial state (remove hive tables, clean all HDFS files etc.) do avoid collision of data?

I want to reuse EMR for several reasons:

  1. Creation of a new EMR can take 5-10 minutes.
  2. My task are relative shorts, 20-25 minutes.
  3. Once EMR was created you already paying for the full hour.

Solution

  • We didn't find a "quick and clean" API to achieve this behaviour. Instead we consolidate a simple work methodology to promise we can clean all the data.

    • We work on a specific DB instead of the default one.
    • We put all our internal data files under a specific location in the HDFS.

    So every time a task started, it first delete this specific DB if exists and recreate it and recursively delete all the data under the specific location in the HDFS.