amazon-web-services emr amazon-data-pipeline

delete s3 files from a pipeline AWS

I would like to ask about a processing task I am trying to complete using a data pipeline in AWS, but I have not been able to get it to work.

Basically, I have 2 data nodes representing 2 MySQL databases, where the data is supposed to be extracted from periodically and placed in an S3 bucket. This copy activity is working fine selecting daily every row that has been added, let's say today - 1 day.

However, that bucket containing the collected data as CSVs should become the input for an EMR activity, which will be processing those files and aggregating the information. The problem is that I do not know how to remove or move the already processed files to a different bucket so I do not have to process all the files every day.

To clarify, I am looking for a way to move or remove already processed files in an S3 bucket from a pipeline. Can I do that? Is there any other way I can only process some files in an EMR activity based on a naming convention or something else?

Solution

Even better, create a DataPipeline ShellCommandActivity and use the aws command line tools.

Create a script with these two lines:

    sudo yum -y upgrade aws-cli 
    aws s3 rm $1 --recursive

The first line ensures you have the latest aws tools.

The second one removes a directory and all its contents. The $1 is an argument passed to the script.

In your ShellCommandActivity:

    "scriptUri": "s3://myBucket/scripts/theScriptAbove.sh",
    "scriptArgument": "s3://myBucket/myDirectoryToBeDeleted"

The details on how the aws s3 command works are at:

    http://docs.aws.amazon.com/cli/latest/reference/s3/index.html