Search code examples
amazon-web-servicesaws-lambdaamazon-transcribe

Remove special characters from S3 object key for Transcribe job


This is the very first Lambda function I have created and I have never written a line of Python before today. I do have programming experience in Salesforce's APEX language so I can understand most of this.

I have this Lambda function that grabs an object (wav) from S3 and sends it to AWS Transcribe. I want the name of the Transcribe job to be the name of the S3 object which I can accomplish if the name is something simple like "recording.wav". My problem comes in when I have a complicated name like "4A6E388B48D454FA993D52611ADD1AB_INT - Integris [email protected]_8082924979__2_X - Survey.wav" due to the special characters messing things up.

Can someone tell me an easy way to remove all of those special characters and replace them with underscores? I made an attempt at using unquote_plus but it didn't fix my issue.

Here is the lambda code:

import boto3
from urllib.parse import unquote_plus
#Create low level clients for s3 and Transcribe
s3  = boto3.client('s3')
transcribe = boto3.client('transcribe')
def lambda_handler(event, context):
    
    #parse out the bucket & file name from the event handler
    for record in event['Records']:
        file_bucket = record['s3']['bucket']['name']
        file_name = record['s3']['object']['key']
        file_name_only = unquote_plus(record['s3']['object']['key'])
        object_url = 'https://s3.amazonaws.com/{0}/{1}'.format(file_bucket, file_name)
            
        response = transcribe.start_transcription_job(
            TranscriptionJobName=file_name_only,
            LanguageCode='en-US',
            MediaFormat='wav',
            Media={
                'MediaFileUri': object_url
            })
        
        print(response)

Here is the error from AWS CloudWatch: enter image description here


Solution

  • You can use a simple regular expression to replace all non-alphanumeric characters:

    # Just for testing
    record = {'s3': {'object': {'key': 'path/to/bad - file@with:symbols!.wav'}}}
    
    import re
    # Use a simple regexp to replace all non alphanumeric characters
    file_name_only = re.sub("[^a-zA-Z0-9]", "_", record['s3']['object']['key'])
    print(file_name_only)
    
    # Outputs: path_to_bad___file_with_symbols__wav