Search code examples
amazon-web-servicesboto3aws-cdkamazon-transcribe

Aws Transribe unable to start_transcription_job without LanguageCode in boto3


I have an audio file in S3.

I don't know the language of the audio file. So I need to use IdentifyLanguage for start_transcription_job().

LanguageCode will be blank since I don't know the language of the audio file.

Envirionment

Using Python 3.8 runtime, boto3 version 1.16.5 , botocore version: 1.19.5, no Lambda Layer.

Here is my code for the Transcribe job:

mediaFileUri = 's3://'+ bucket_name+'/'+prefixKey

transcribe_client = boto3.client('transcribe')

response = transcribe_client.start_transcription_job(
    TranscriptionJobName="abc",
    IdentifyLanguage=True,
    Media={
        'MediaFileUri':mediaFileUri
    },
)

Then I get this error:

{
  "errorMessage": "Parameter validation failed:\nMissing required parameter in input: \"LanguageCode\"\nUnknown parameter in input: \"IdentifyLanguage\", must be one of: TranscriptionJobName, LanguageCode, MediaSampleRateHertz, MediaFormat, Media, OutputBucketName, OutputEncryptionKMSKeyId, Settings, ModelSettings, JobExecutionSettings, ContentRedaction",
  "errorType": "ParamValidationError",
  "stackTrace": [
    "  File \"/var/task/app.py\", line 27, in TranscribeSoundToWordHandler\n    response = response = transcribe_client.start_transcription_job(\n",
    "  File \"/var/runtime/botocore/client.py\", line 316, in _api_call\n    return self._make_api_call(operation_name, kwargs)\n",
    "  File \"/var/runtime/botocore/client.py\", line 607, in _make_api_call\n    request_dict = self._convert_to_request_dict(\n",
    "  File \"/var/runtime/botocore/client.py\", line 655, in _convert_to_request_dict\n    request_dict = self._serializer.serialize_to_request(\n",
    "  File \"/var/runtime/botocore/validate.py\", line 297, in serialize_to_request\n    raise ParamValidationError(report=report.generate_report())\n"
  ]
}

With this error, means that I must specify the LanguageCode and IdentifyLanguage is an invalid parameter.

100% sure the audio file exist in S3. But without LanguageCode it don't work, and IdentifyLanguage parameter is unknown parameter

I using SAM application to test locally using this command:

sam local invoke MyHandler -e lambda\TheDirectory\event.json

And I cdk deploy, and check in Aws Lambda Console as well, tested it the same events.json, but still getting the same error

This I think is Lambda Execution environment, I didn't use any Lambda Layer.

I look at this docs from Aws Transcribe:

https://docs.aws.amazon.com/transcribe/latest/dg/API_StartTranscriptionJob.html

and this docs of boto3:

https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/transcribe.html#TranscribeService.Client.start_transcription_job

Clearly state that LanguageCode is not required and IdentifyLanguage is a valid parameter.

So what I missing out? Any idea on this? What should I do?

Update:

I keep searching and asked couple person online, I think I should build the function container 1st to let SAM package the boto3 into the container.

So what I do is, cdk synth a template file:

cdk synth --no-staging > template.yaml

Then:

sam build --use-container
sam local invoke MyHandler78A95900 -e lambda\TheDirectory\event.json

But still, I get the same error, but post the stack trace as well

[ERROR] ParamValidationError: Parameter validation failed:
Missing required parameter in input: "LanguageCode"
Unknown parameter in input: "IdentifyLanguage", must be one of: TranscriptionJobName, LanguageCode, MediaSampleRateHertz, MediaFormat, Media, OutputBucketName, OutputEncryptionKMSKeyId, Settings, JobExecutionSettings, ContentRedaction
Traceback (most recent call last):
  File "/var/task/app.py", line 27, in TranscribeSoundToWordHandler
    response = response = transcribe_client.start_transcription_job(
  File "/var/runtime/botocore/client.py", line 316, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/var/runtime/botocore/client.py", line 607, in _make_api_call
    request_dict = self._convert_to_request_dict(
  File "/var/runtime/botocore/client.py", line 655, in _convert_to_request_dict
    request_dict = self._serializer.serialize_to_request(
  File "/var/runtime/botocore/validate.py", line 297, in serialize_to_request
    raise ParamValidationError(report=report.generate_report())

Really no clue what I doing wrong here. I also report a github issue here, but seem like cant reproduce the issue.

Main Question/Problem:

Unable to start_transription_job

  1. without LanguageCode

  2. with IdentifyLanguage=True

What possible reason cause this, and how can I solve this problem(Dont know the languange of the audio file, I want to identify language of audio file without given the LanguageCode) ?


Solution

  • End up I notice this is because my packaged lambda function isn’t being uploaded for some reason. Here is how I solved it after getting help from couple of people.

    First modify CDK stack which define my lambda function like this:

    from aws_cdk import (
        aws_lambda as lambda_,
        core
    )
    
    from aws_cdk.aws_lambda_python import PythonFunction
    
    class MyCdkStack(core.Stack):
    
        def __init__(self, scope: core.Construct, id: str, **kwargs) -> None:
            super().__init__(scope, id, **kwargs)
    
            # define lambda 
            my_lambda = PythonFunction(
                self, 'MyHandler',
                entry='lambda/MyHandler',
                index='app.py',
                runtime=lambda_.Runtime.PYTHON_3_8,
                handler='MyHandler', 
                timeout=core.Duration.seconds(10)
            )
    

    This will use aws-lambda-python module ,it will handle installing all required modules into the docker.

    Next, cdk synth a template file

    cdk synth --no-staging > template.yaml 
    

    At this point, it will bundling all the stuff inside entry path which define in PythonFunction and install all the necessary dependencies defined in requirements.txt inside that entry path.

    Next, build the docker container

    $ sam build --use-container
    

    Make sure template.yaml file in root directory. This will build a docker container, and the artifact will build inside .aws-sam/build directory in my root directory.

    Last step, invoke the function using sam:

    sam local invoke MyHandler78A95900 -e path\to\event.json
    

    Now finally successfully call start_transcription_job as stated in my question above without any error.

    In Conclusion:

    1. At the very beginning I only pip install boto3, this only will install the boto3 in my local system.
    2. Then, I sam local invoke without build the container 1st by sam build --use-container
    3. Lastly, I have sam build at last, but in that point, I didn't bundle what defined inside requirements.txt into the .aws-sam/build, therefore need to use aws-lambda-python module as stated above.