I have several task I'm preforming on AWS EMRs which don't share data and I would like to use the same EMR to perform them one after another. Is there a way to clean a running EMR back to its initial state (remove hive tables, clean all HDFS files etc.) do avoid collision of data?
I want to reuse EMR for several reasons:
We didn't find a "quick and clean" API to achieve this behaviour. Instead we consolidate a simple work methodology to promise we can clean all the data.
So every time a task started, it first delete this specific DB if exists and recreate it and recursively delete all the data under the specific location in the HDFS.