Search code examples
amazon-web-servicesamazon-s3aws-lambda

Do Amazon S3 or Lambda cache files or data by default? How to turn it off?


This seems to be very strange, because as far as I know Amazon S3 and Lambda has nothing to do with caching by default, however in my case it seems so.

I am trying to use AWS Polly to convert text to speech and store the created mp3 files in a S3 bucket. I use AWS Lambda to kick-off Polly.

Now, I've created a AWS Lambda Test which takes the following two properties: ID and Text. ID will be the name of the mp3 file and the text is to be converted. The Text is the same throughout all the steps:

  1. Initially I've run a bad test with ID my-post in which I launched the Lambda several times for the same ID. This resulted in one big audio (18MB) file repeating the provided text over-and-over (boo!).
  2. If I run the Lambda again (only once) with the same text but new ID (e.g.: my-post-2), I get the small audio file reading the text only once (yay!).
  3. Then I deleted both files from S3.
  4. I Ran the Lambda again with ID my-post-2. I got the small file again, as expected.
  5. I ran the Lambda again with ID my-post. I get the big file again.

Now I'm stuck in a state where if I run my Lambda using the original ID (obviously the one I want to use) generates a huge audio file repeating the text several times, but if I use another ID then I'll get the normal sized file, reading the text once.

Here is my function:

import boto3
import os
from contextlib import closing
from boto3.dynamodb.conditions import Key, Attr

def lambda_handler(event, context):

audiopostid = event["audiopostid"]
text = event["text"]
voice = event["voice"] 

rest = text

textBlocks = []
while (len(rest) > 1100):
    begin = 0
    end = rest.find(".", 1000)

    if (end == -1):
        end = rest.find(" ", 1000)

    textBlock = rest[begin:end]
    rest = rest[end:]
    textBlocks.append(textBlock)
textBlocks.append(rest)            

polly = boto3.client('polly')
for textBlock in textBlocks: 
    response = polly.synthesize_speech(
        OutputFormat='mp3',
        Text = textBlock,
        VoiceId = voice
    )

    if "AudioStream" in response:
        with closing(response["AudioStream"]) as stream:
            output = os.path.join("/tmp/", audiopostid)
            with open(output, "a") as file:
                file.write(stream.read())


s3 = boto3.client('s3')
s3.upload_file('/tmp/' + audiopostid, 
  os.environ['BUCKET_NAME'], 
  audiopostid + ".mp3")
s3.put_object_acl(ACL='public-read', 
  Bucket=os.environ['BUCKET_NAME'], 
  Key= audiopostid + ".mp3")

location = s3.get_bucket_location(Bucket=os.environ['BUCKET_NAME'])
region = location['LocationConstraint']

if region is None:
    url_begining = "https://s3.amazonaws.com/"
else:
    url_begining = "https://s3-" + str(region) + ".amazonaws.com/" \

url = url_begining \
        + str(os.environ['BUCKET_NAME']) \
        + "/" \
        + str(audiopostid) \
        + ".mp3"

return

Solution

  • AWS lambda can reuse existing execution container for subsequent lambda invocations. Reusing container keeps content of /tmp directory intact. This is advantageous in many scenarios, since files can be used as a cache shared by multiple lambda invocations.

    But this can cause problem in your case. Because you open file in append mode (open(output, "a")), lamba invoked in reused container just appends new .mp3 file to file from previous invocation(s), making your audio repeaing several times.

    Deleting all existing temporary files (that should not be reused) at the start of lambda function should solve the problem. This will also solve problem with limited (500 MB) disk space, since after executing lambda with several different IDs could leave /tmp full of files with not enough space to write another file.