This seems to be very strange, because as far as I know Amazon S3 and Lambda has nothing to do with caching by default, however in my case it seems so.
I am trying to use AWS Polly to convert text to speech and store the created mp3 files in a S3 bucket. I use AWS Lambda to kick-off Polly.
Now, I've created a AWS Lambda Test which takes the following two properties: ID and Text. ID will be the name of the mp3 file and the text is to be converted. The Text is the same throughout all the steps:
my-post
in which I launched the Lambda several times for the same ID. This resulted in one big audio (18MB) file repeating the provided text over-and-over (boo!).my-post-2
), I get the small audio file reading the text only once (yay!).my-post-2
. I got the small file again, as expected.my-post
. I get the big file again.Now I'm stuck in a state where if I run my Lambda using the original ID (obviously the one I want to use) generates a huge audio file repeating the text several times, but if I use another ID then I'll get the normal sized file, reading the text once.
Here is my function:
import boto3
import os
from contextlib import closing
from boto3.dynamodb.conditions import Key, Attr
def lambda_handler(event, context):
audiopostid = event["audiopostid"]
text = event["text"]
voice = event["voice"]
rest = text
textBlocks = []
while (len(rest) > 1100):
begin = 0
end = rest.find(".", 1000)
if (end == -1):
end = rest.find(" ", 1000)
textBlock = rest[begin:end]
rest = rest[end:]
textBlocks.append(textBlock)
textBlocks.append(rest)
polly = boto3.client('polly')
for textBlock in textBlocks:
response = polly.synthesize_speech(
OutputFormat='mp3',
Text = textBlock,
VoiceId = voice
)
if "AudioStream" in response:
with closing(response["AudioStream"]) as stream:
output = os.path.join("/tmp/", audiopostid)
with open(output, "a") as file:
file.write(stream.read())
s3 = boto3.client('s3')
s3.upload_file('/tmp/' + audiopostid,
os.environ['BUCKET_NAME'],
audiopostid + ".mp3")
s3.put_object_acl(ACL='public-read',
Bucket=os.environ['BUCKET_NAME'],
Key= audiopostid + ".mp3")
location = s3.get_bucket_location(Bucket=os.environ['BUCKET_NAME'])
region = location['LocationConstraint']
if region is None:
url_begining = "https://s3.amazonaws.com/"
else:
url_begining = "https://s3-" + str(region) + ".amazonaws.com/" \
url = url_begining \
+ str(os.environ['BUCKET_NAME']) \
+ "/" \
+ str(audiopostid) \
+ ".mp3"
return
AWS lambda can reuse existing execution container for subsequent lambda invocations. Reusing container keeps content of /tmp
directory intact. This is advantageous in many scenarios, since files can be used as a cache shared by multiple lambda invocations.
But this can cause problem in your case. Because you open file in append mode (open(output, "a")
), lamba invoked in reused container just appends new .mp3 file to file from previous invocation(s), making your audio repeaing several times.
Deleting all existing temporary files (that should not be reused) at the start of lambda function should solve the problem. This will also solve problem with limited (500 MB) disk space, since after executing lambda with several different ID
s could leave /tmp
full of files with not enough space to write another file.