I need to move a few 10s of millions of files with total TBs of size, into Glacier vault. This is going to take a long time, and I am worried that will be errors along the way.
how can prevent a case where the upload stops in the middle, and then i am not sure which files are already uploaded and have to start all over again ? i should write my own python code and work with lists and check against Glacier if file was already uploaded, or are there tools that have this built in ?
Thank you
You could use one of the new [AWS Snowcone) units — it stores 8TB of data.
Alternatively, it's a matter of bandwidth to determine how long an upload will take. Using the AWS Command-Line Interface (CLI) aws s3 sync
command will make it possible to recover from failures, but it can take a long time to read through millions of files. It would be good if you could segment it into smaller blocks when copying.
Actually, it might be a good use-case for AWS DataSync:
AWS DataSync is an online data transfer service that simplifies, automates, and accelerates copying large amounts of data to and from AWS storage services over the internet or AWS Direct Connect. DataSync can copy data between Network File System (NFS), Server Message Block (SMB) file servers, or AWS Snowcone, and Amazon Simple Storage Service (Amazon S3) buckets, Amazon EFS file systems, and Amazon FSx for Windows File Server file systems.
It will move the data in a faster, more managed way.
I would advise against moving your data into a Glacier Vault. Accessing Glacier is notoriously slow, and really requires software tools to use it correctly.
Instead, I would suggest putting your data into Amazon S3. You can then use Object lifecycle management to change the storage class of the objects. If your goal is low-cost storage, then select Glacier Deep Archive, which is actually half the price of the normal Glacier service.
If you want to persist with using a Glacier Vault, I suggest you do a few 'trial' uploads and retrievals to discover whether you are willing to use the service for all your data. (Frankly, there's little reason to go direct to Glacier these days.)