Search code examples
amazon-web-servicesamazon-s3aws-cli

How to unzip .zip and .gz files in S3 and move them to a different location within the same bucket in S3 making sure no duplicate files are moved


Let's say I have the following files in an S3 bucket -

  1. loc/abcd.zip
  2. loc/abcd.txt
  3. loc/efgh.gz
  4. loc/ijkl.zip

All zipped files contain txt files within them with the same name.

I want to unzip the .zip and .gz files and move all the txt files to a different location in the same S3 bucket (say newloc/). The files should only be moved once.

So the files in the destination should look like -

  1. newloc/abcd.txt
  2. newloc/efgh.txt
  3. newloc/ijkl.txt

In the above example, abcd.txt was only moved once to newloc/ even though loc/ had both abcd.zip and abcd.txt present.

Fairly new to AWS CLI or AWS in general and not sure how to achieve this. The txt files are 800 in number and about 500MB - 1GB each.


Solution

  • There is no in-built capability in Amazon S3 to unzip files.

    You should write a script that lists the files, then loops through them and:

    • Downloads the zip/gz file
    • Unzips the file
    • Loops through the resulting files and uploads them to the desired S3 location
    • Deletes the local files

    Using Python and the boto3 library would be easier than writing shell script and using the AWS CLI.

    You can check whether an object already exists in S3 by using the head_object() command.

    See: Amazon S3 examples - Boto3 documentation