Search code examples
google-cloud-platformspeech-recognitionspeech-to-textgoogle-speech-apigoogle-cloud-speech

Google Cloud Platform: Speech to Text Conversion of Large Media Files


I'm trying to extract text from mp4 media file downloaded from youtube. As I'm using google cloud platform so thought to give a try to google cloud speech.

After all the installations and configurations, I copied the following code snippet to get start with:

with io.open(file_name, 'rb') as audio_file:
    content = audio_file.read()
    audio = types.RecognitionAudio(content=content)

config = types.RecognitionConfig(encoding=enums.RecognitionConfig.AudioEncoding.LINEAR16, sample_rate_hertz=16000, language_code='en-US')   

response = client.long_running_recognize(config, audio)

But I got the following error regarding file size:

InvalidArgument: 400 Inline audio exceeds duration limit. Please use a GCS URI.

Then I read that I should use streams for large media files. So, I tried the following code snippet:

with io.open(file_name, 'rb') as audio_file:
    content = audio_file.read()

#In practice, stream should be a generator yielding chunks of audio data.

stream = [content]
requests = (types.StreamingRecognizeRequest(audio_content=chunk)for chunk in stream)

config = types.RecognitionConfig(encoding=enums.RecognitionConfig.AudioEncoding.LINEAR16,sample_rate_hertz=16000,language_code='en-US')

streaming_config = types.StreamingRecognitionConfig(config=config)

responses = client.streaming_recognize(streaming_config, requests)

But still I got the following error:

InvalidArgument: 400 Invalid audio content: too long.

So, can anyone please suggest an approach to transcribe an mp4 file and extract text. I don't have any complex requirement of very large media file. Media file can be 10-15 mins long maximum. Thanks


Solution

  • The error message means that the file is too big and you need to first copy the media file to Google Cloud Storage and then specify a Cloud Storage URI such as gs://bucket/path/mediafile.

    The key to using a Cloud Storage URI is:

    RecognitionAudio audio = RecognitionAudio.newBuilder().setUri(gcsUri).build();

    The following code will show you how to specify a GCS URI for input. Google has a complete example on github.

      public static void syncRecognizeGcs(String gcsUri) throws Exception {
        // Instantiates a client with GOOGLE_APPLICATION_CREDENTIALS
        try (SpeechClient speech = SpeechClient.create()) {
          // Builds the request for remote FLAC file
          RecognitionConfig config =
              RecognitionConfig.newBuilder()
                  .setEncoding(AudioEncoding.FLAC)
                  .setLanguageCode("en-US")
                  .setSampleRateHertz(16000)
                  .build();
          RecognitionAudio audio = RecognitionAudio.newBuilder().setUri(gcsUri).build();
    
          // Use blocking call for getting audio transcript
          RecognizeResponse response = speech.recognize(config, audio);
          List<SpeechRecognitionResult> results = response.getResultsList();
    
          for (SpeechRecognitionResult result : results) {
            // There can be several alternative transcripts for a given chunk of speech. Just use the
            // first (most likely) one here.
            SpeechRecognitionAlternative alternative = result.getAlternativesList().get(0);
            System.out.printf("Transcription: %s%n", alternative.getTranscript());
          }
        }
      }