amazon-web-services amazon-s3 aws-lambda

How to extract files in S3 on the fly with boto3?

I'm trying to find a way to extract .gz files in S3 on the fly, that is no need to download it to locally, extract and then push it back to S3.

With boto3 + lambda, how can i achieve my goal?

I didn't see any extract part in boto3 document.

Solution

Amazon S3 is a storage service. There is no in-built capability to manipulate the content of files.

However, you could use an AWS Lambda function to retrieve an object from S3, decompress it, then upload content back up again. However, please note that there is default limit of 500MB in temporary disk space for Lambda, so avoid decompressing too much data at the same time.

You could configure the S3 bucket to trigger the Lambda function when a new file is created in the bucket. The Lambda function would then:

Use boto3 to download the new file
Use the gzip Python library to extract files
Use boto3 to upload the resulting file(s)

Sample code:

import gzip
import io

import boto3

bucket = '<bucket_name>'
key = '<key_name>'

s3 = boto3.client('s3', use_ssl=False)
compressed_file = io.BytesIO(
    s3.get_object(Bucket=bucket, Key=key)['Body'].read())
uncompressed_file = gzip.GzipFile(None, 'rb', fileobj=compressed_file)
s3.upload_fileobj(Fileobj=uncompressed_file, Bucket=bucket, Key=key[:-3])