python amazon-web-services amazon-s3 boto3 amazon-sqs

How to download a file which filename has space from S3

My system is defined as S3 -> Lambda -> SQS -> EC2. When a file is uploaded to S3, it triggers S3 notification to Lambda. Lambda captures S3 bucket name and object key by

s3_bucket = event['Records'][0]['s3']['bucket']['name']
s3_key = event['Records'][0]['s3']['object']['key']

The message is converted to JSON and sent to SQS. The conversion is done by

json.dumps({'from_s3': 's3://{b}/{k}'.format(b=s3_bucket, k=s3_key)})

Then EC2 polls the SQS by boto3

response = sqs_client.receive_message(QueueUrl=queue_url, AttributeNames=['ALL'], MaxNumberOfMessages=5)
messages = response['Messages']
body = json.loads(messages[i]['Body']
from_s3 = body['from_s3']
s3_bucket, s3_key = re.match(r"s3:\/\/(.+?)\/(.+)", from_s3).groups()

According to the log, if an uploaded file has spaces, e.g. "abc def.jpg". The received value of s3_key will get "abc+def.jpg". As a result, when I download the file by the value via download_file of boto3 s3 client, it returns 404 error.

How should I encode the S3 object key in Lambda so that boto3 s3 client can download?

Solution

To obtain the unquoted key, you can use:

objectKey = urllib.parse.unquote_plus(event['Records'][0]['object']['key']))

Also, please note that there might be multiple events provided to your AWS Lambda function. It should look through the events like this:

for record in events['Records']:
  s3_bucket = record['s3']['bucket']['name']
  s3_key = urllib.parse.unquote_plus(record['s3']['object']['key']))

  # Do stuff here