Search code examples
amazon-web-servicesamazon-s3aws-batchaws-datasync

Efficient way to Copy/Replicate S3 Objects?


I need to replicate Millions (one time) of S3 Objs by modifying the metadata (within the same bucket, and obj path)

To perform this, we've various options mentioned below, we need to choose cost-effective method:

  1. AWS COPY requests
  2. AWS Batch Operations
  3. AWS DataSync

References: https://repost.aws/knowledge-center/s3-large-transfer-between-buckets

I've read AWS Docs but could not get which one is better in terms of cost.


Solution

  • To update metadata on an Amazon S3 object, it is necessary to COPY the object to itself while specifying the new metadata.

    From Copying objects - Amazon Simple Storage Service:

    Each Amazon S3 object has metadata. It is a set of name-value pairs. You can set object metadata at the time you upload it. After you upload the object, you cannot modify object metadata. The only way to modify object metadata is to make a copy of the object and set the metadata. In the copy operation, set the same object as the source and target.

    However, you have a choice as to how to trigger the COPY operation:

    • You can write your own code that loops through the objects and performs the copy, or
    • You can use S3 Batch Operations to perform the copy

    Given that you have millions of objects, I would recommend using S3 Batch Operations since it can perform the process with massive scale.

    I would recommend this process:

    • Activate Amazon S3 Inventory on the bucket, which can provide a daily or weekly CSV file listing all objects.
    • Take the S3 Inventory output file and treat it as a manifest file for the batch operation. You will need to edit the file (either via code or an Excel spreadsheet) to tell it to copy the objects to themselves and also to specify the desired metadata.
    • Submit the manifest file to S3 Batch Operations. (It can take some time to start executing.)

    I suggest that you try the S3 Batch Operations step on a subset of objects (eg 10 objects) first to confirm that it operates the way you expect. This will be relatively fast and will avoid any potential errors.

    Note that S3 Batch Operations charges $1.00 per million object operations performed.