Search code examples
google-speech-to-text-api

When should I use the enhanced video model with Google Cloud's speech to text api?


The enhanced model for phone calls means something to me because there is generally a particular quality/sound to the audio in a phone call. I don't know what to expect with the 'video' enhanced model, however, and there seems to be no documentation for it. There could be a huge range of sound quality in a video, from a pristine studio recorded videocast to someone's barely audible speech recorded outdoors on an iphone when its windy. The audio compression in a video could be all over the place as well. What specific scenarios is the 'video' model actually designed for? When will it work better than either the default model or phone call model?


Solution

  • Speech to Text API offers prebuilt models that are best suited for specific scenarios. One of the models is the Video model which is best used for the said use case:

    Use this model for transcribing audio from video clips or other sources (such as podcasts) that have multiple speakers. This model is also often the best choice for audio that was recorded with a high-quality microphone or that has lots of background noise. For best results, provide audio recorded at 16,000Hz or greater sampling rate.

    Note: This is a premium model that costs more than the standard rate.

    For reference see Selecting models for more details on what models to use.