amazon-web-services amazon-s3 amazon-data-pipeline

Using Amazon's Date Pipeline to backup S3 bucket -- how to skip existing files and avoid unnecessary overwriting?

I'm using Amazon's Date Pipeline to copy and S3 bucket to another bucket. It's a pretty straightforward setup, and runs nightly. However, every subsequent run copies the same files over and over--I'd rather it just skip existing files and copy only the new ones, as this backup is going to get quite large in the future. Is there a way to do this??

Solution

Looking at this thread, it seems to be not possible to do the sync with default CopyActivity:

You can definitely use Data Pipeline to copy one S3 directory to another, with the caveat that, if you use the CopyActivity, it'll be a fully copy, not an rsync. So if you're operating on a large number of files where only a small fraction have changed, the CopyActivity wouldn't be the most efficient way to do it.

You could also write your own logic to perform the diff and then only sync that, and use the CommandRunnerActivity to schedule and manage it.

I think they are actually refer to the ShellCommandActivity which allows you to schedule the shell command to run.

I can't give you an exact configuration example, but here is the example of command you can run with regular cron job to sync two buckets: aws s3 sync s3://source_bucket s3://target_bucket.

It should be possible to run it with ShellCommandActivity. Check also ShellCommandActivity in AWS Data Pipeline, and the comments to the answer here.

Update: the comment by @trevorhinesley with final solution (the default instance launched by the pipeline uses some old aws cli where there is no sync command):

For anyone who comes across this, I had to fire up an EC2 instance, then copy the AMI ID that it used (it's in the info below the list of instances when you select it in the Instances menu under EC2). I used that image ID in the data pipeline and it fixed it!