Search code examples
amazon-web-servicesamazon-s3archiveamazon-glacier

Archiving millions of small files on S3 to S3 Glacier Deep Archive


I have about 80,000,000 50KB files on S3 (4TB), which I want to transfer to Glacier DA. I have come to realize there's a cost inefficiency in transferring a lot of small files to Glacier.

Assuming I don't mind archiving my files into a single (or multiple) tar/zips - what would be the best practice to transition those files to Glacier DA?

It is important to note that I only have these files on S3, and not on any local machine.


Solution

  • The most efficient way would be:

    • Launch an Amazon EC2 instance in the same region as the bucket. Choose an instance type with high-bandwidth networking (eg t3 family). Launch it with spot pricing because you can withstand the small chance that it is stopped. Assign plenty of EBS disk space. (Alternatively, you could choose a Storage Optimized instance since the disk space is included free, but the instance is more expensive. Your choice!)
    • Download a subset of the files to the instance using the AWS Command-Line Interface (CLI) by specifying a path (subdirectory) to copy. Don't try and do it all at once!
    • Zip/compress the files on the EC2 instance
    • Upload the compressed files to S3 using --storage-class DEEP_ARCHIVE
    • Check that everything seems good, and repeat for another subset!

    The above would incur very little charge since you can terminate the EC2 when it is no longer needed, and EBS is only charged while the volumes exist.

    If it takes too long to list a subset of the files, you might consider using Amazon S3 Inventory, which can provide a daily or weekly CSV file listing all objects. You can then use this list to specifically copy files, or identify a path/subdirectory to copy.

    As an extra piece of advice... if your system is continuing to collect even more files, you might consider collecting the data in a different way (eg streaming to Kinesis Firehose to batch data together), or combining the data on a regular basis rather than letting it creep up to so many files again. Fewer, larger files are much easier to use in processes if possible.