Search code examples
node.jsgoogle-cloud-speech

Google Speech to Text API gives different results locally than the online demo


6 second mp3 audio file(download) First tested directly on https://cloud.google.com/speech-to-text/ and the response was as expected.

"hello brother how are you doing I'm doing really well hope mom is doing well"

Then I created firebase Function(see code below):

const speech = require('@google-cloud/speech').v1p1beta1
exports.speechToText = functions.https.onRequest(async (req, res) => {
  try {
    // Creates a client
    const client = new speech.SpeechClient()
    const gcsUri = `gs://xxxxx.appspot.com/speech.mp3`

    const config = {
      encoding: 'MP3',
      languageCode: 'en-US',
      enableAutomaticPunctuation: false,
      enableWordTimeOffsets: false,
    }
    const audio = {
      uri: gcsUri,
    }

    const request = {
      config: config,
      audio: audio,
    }

    // Detects speech in the audio file
    const [response] = await client.recognize(request)
    const transcription = response.results
      .map(result => result.alternatives[0].transcript)
      .join('\n')
    console.log(`Transcription: ${transcription}`)
    res.send({ response })
  } catch (error) {
    console.log('error:', error)
    res.status(400).send({
      error,
    })
  }
})

And I get the following INCORRECT response:

"hello brother, how are you doing hope all is doing well"

UPDATE: The same INCORRECT response is received when running Locally. So Cloud Functions are not the issue.

UPDATE #2: setting the model:'video' OR model:'phone_call' in config solved the issue. i.e

    const config = {
      encoding: 'MP3',
      languageCode: 'en-US',
      enableAutomaticPunctuation: false,
      enableWordTimeOffsets: false,
      model: 'phone_call',
    }

Solution

  • setting the model:'video' OR model:'phone_call' in config solved the issue. i.e

    const config = {
     encoding: 'MP3',
     languageCode: 'en-US',
     enableAutomaticPunctuation: false,
     enableWordTimeOffsets: false,
     model: 'phone_call', 
    }
    

    I suppose the default model doesn't work on certain type of audio.