Search code examples
google-cloud-platformspeech-to-textgoogle-speech-to-text-apivideo-intelligence-api

How are Speech-to-Text and Video Intelligence SPEECH_TRANSCRIPTION related?


My goal is to process several videos using a speech-to-text model.

Google confusingly has two products that seem to do the same thing.

What are the major differences between these offering?

  1. Google Cloud Speech-to-Text: https://cloud.google.com/speech-to-text/docs/basics

    • Speech-to-Text has an "enhanced video" model for interpreting the audio.
  2. Google Video Intelligence: https://cloud.google.com/video-intelligence/docs/feature-speech-transcription

    • VI has the option to request a SPEECH_TRANSCRIPTION feature

Solution

  • The main difference between the two are the input used. Speech to Text API only accepts audio inputs while Video Intelligence accepts video inputs.

    As mentioned in your question "Speech to Text has an enhance video model", it means that it has a model that is designed to transcribe audio that originated from video files. Meaning the original file was in video, then converted to audio. As seen in this tutorial, the video was converted to audio prior to transcribing it.

    I suggest to use Video Intelligence API if you would like to directly transcribe the audio content into text. You can follow this tutorial on how to transcribe text using Video Intelligence API.