Search code examples
pythonspeech-to-textgoogle-speech-apigoogle-speech-to-text-apihint-phrases

Custom phrases/words are ignored by Google Speech-To-Text


I am using python3 to transcribe an audio file with Google speech-to-text via the provided python packages (google-speech).

There is an option to define custom phrases which should be used for transcription as stated in the docs: https://cloud.google.com/speech-to-text/docs/speech-adaptation

For testing purposes I am using a small audio file with the contained text:

[..] in this lecture we'll talk about the Burrows wheeler transform and the FM index [..]

And I am giving the following phrases to see the effects if for example I want a specific name to be recognized with the correct notation. In this example I want to change burrows to barrows:

config = speech.RecognitionConfig(dict(
    encoding=speech.RecognitionConfig.AudioEncoding.ENCODING_UNSPECIFIED,
    sample_rate_hertz=24000,
    language_code="en-US",
    enable_word_time_offsets=True,
    speech_contexts=[
        speech.SpeechContext(dict(
            phrases=["barrows", "barrows wheeler", "barrows wheeler transform"]
        ))
    ]
))

Unfortunately this does not seem to have any effect as the output is still the same as without the context phrases.

Am I using the phrases wrong or has it such a high confidence that the word it hears is indeed burrows so that it will ignore my phrases?

PS: I also tried using the speech_v1p1beta1.AdaptationClient and speech_v1p1beta1.SpeechAdaptation instead of putting the phrases into the config but this only gives me an internal server error with no additional information on what is going wrong. https://cloud.google.com/speech-to-text/docs/adaptation


Solution

  • I have created an audio file to recreate your scenario and I was able to improve the recognition using the model adaptation. To achieve this with this feature, I would suggest taking a look at this example and this post to better understand the adaptation model.

    Now, to improve the recognition of your phrase, I performed the following:

    1. I created a new audio file using the following page with the mentioned phrase.

    in this lecture we'll talk about the Burrows wheeler transform and the FM index

    1. My tests were based on this code sample. This code creates a PhraseSet and CustomClass that includes the word you would like to improve, in this case the word "barrows". You can also create/update/delete the phrase set and custom class using the Speech-To-Text GUI. Below is the code I used for the improvement.
    from os import pathconf_names
    from google.cloud import speech_v1p1beta1 as speech
    import argparse
    
    
    def transcribe_with_model_adaptation(
        project_id="[PROJECT-ID]", location="global", speech_file=None, custom_class_id="[CUSTOM-CLASS-ID]", phrase_set_id="[PHRASE-SET-ID]"
    ):
        """
        Create`PhraseSet` and `CustomClasses` to create custom lists of similar
        items that are likely to occur in your input data.
        """
        import io
    
        # Create the adaptation client
        adaptation_client = speech.AdaptationClient()
    
        # The parent resource where the custom class and phrase set will be created.
        parent = f"projects/{project_id}/locations/{location}"
    
        # Create the custom class resource
        adaptation_client.create_custom_class(
            {
                "parent": parent,
                "custom_class_id": custom_class_id,
                "custom_class": {
                    "items": [
                        {"value": "barrows"}
                    ]
                },
            }
        )
        custom_class_name = (
            f"projects/{project_id}/locations/{location}/customClasses/{custom_class_id}"
        )
        # Create the phrase set resource
        phrase_set_response = adaptation_client.create_phrase_set(
            {
                "parent": parent,
                "phrase_set_id": phrase_set_id,
                "phrase_set": {
                    "boost": 0,
                    "phrases": [
                        {"value": f"${{{custom_class_name}}}", "boost": 10},
                        {"value": f"talk about the ${{{custom_class_name}}} wheeler transform", "boost": 15}
                    ],
                },
            }
        )
        phrase_set_name = phrase_set_response.name
        # print(u"Phrase set name: {}".format(phrase_set_name))
     
        # The next section shows how to use the newly created custom
        # class and phrase set to send a transcription request with speech adaptation
    
        # Speech adaptation configuration
        speech_adaptation = speech.SpeechAdaptation(
            phrase_set_references=[phrase_set_name])
    
        # speech configuration object
        config = speech.RecognitionConfig(
            encoding=speech.RecognitionConfig.AudioEncoding.FLAC,
            sample_rate_hertz=24000,
            language_code="en-US",
            adaptation=speech_adaptation,
            enable_word_time_offsets=True,
            model="phone_call",
            use_enhanced=True
        )
    
        # The name of the audio file to transcribe
        # storage_uri URI for audio file in Cloud Storage, e.g. gs://[BUCKET]/[FILE]
        with io.open(speech_file, "rb") as audio_file:
            content = audio_file.read()
    
        audio = speech.RecognitionAudio(content=content)
        # audio = speech.RecognitionAudio(uri="gs://biasing-resources-test-audio/call_me_fionity_and_ionity.wav")
    
        # Create the speech client
        speech_client = speech.SpeechClient()
    
        response = speech_client.recognize(config=config, audio=audio)
    
        for result in response.results:
            # The first alternative is the most likely one for this portion.
            print(u"Transcript: {}".format(result.alternatives[0].transcript))
    
        # [END speech_transcribe_with_model_adaptation]
    
    
    if __name__ == "__main__":
        parser = argparse.ArgumentParser(
            description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter
        )
        parser.add_argument("path", help="Path for audio file to be recognized")
        args = parser.parse_args()
    
        transcribe_with_model_adaptation(speech_file=args.path)
    
    
    
    1. Once it runs, you will receive an improved recognition as the below; however, consider that the code tries to create a new custom class and a new phrase set when it runs, and it might throw an error with a element already exists message if try to re-create the custom class and the phrase set.
    • Using the recognition without the adaptation
    (python_speech2text) user@penguin:~/replication/python_speech2text$ python speech_model_adaptation_beta.py audio.flac
    Transcript: in this lecture will talk about the Burrows wheeler transform and the FM index
    

    enter image description here

    • Using the recognition with the adaptation
    (python_speech2text) user@penguin:~/replication/python_speech2text$ python speech_model_adaptation_beta.py audio.flac
    Transcript: in this lecture will talk about the barrows wheeler transform and the FM index
    

    enter image description here


    Finally, I would like to add some notes about the improvement and the code I performed:

    • I have used a flac audio file as it is recommended for optimal results.

    • I have used the model="phone_call" and use_enhanced=True as this was the model recognized by Cloud Speech-To-Text using my own audio file. Also the enhanced model can provide better results, you can see the documentation for more details. Note that this configuration might vary from your audio file.

    • Consider enable data logging to Google to collect data from your audio transcription requests. Google then uses this data to improve its machine learning models used for recognizing speech audio.

    • Once I have create the custom class and the phrase set, you can use the Speech-to-Text UI to updae and perform your tests quickly. only contains the

    • I have used in the phrase set the parameter boost, when you use boost, you assign a weighted value to phrase items in a PhraseSet resource. Speech-to-Text refers to this weighted value when selecting a possible transcription for words in your audio data. The higher the value, the higher the likelihood that Speech-to-Text chooses that word or phrase from the possible alternatives.

    I hope this information helps you to improve your recognitions.