So although it's still a little shocking to me, Google's default speech recognition completely and totally ignores music/ambient noise. The problem is, for my use case I want it to actually try to transcribe the music!
I'm using the Web Speech API in Chrome 72 with the demo they have.
I can't get it to pick up things said from music at all, even when I place the speaker next to the mic.
I also can't get it to pick up any Youtube Videos or videos playing from online.
It also doesn't pick up anything my Alexa says.
I have an Android so I'm assuming they're doing something similar to Amazon in commercials by playing an unhearable sound that they use to cancel out the recording? Is there any way to disable this?
It also doesn't work if I play music from my Mac or PC directly.
It however DOES transcribe if I video chat someone (using WebRTC if that matters) and they say something which is played through the speakers.
For anyone wondering, I want it to transcribe a video that is playing on the same page of a human speaking with no background music. I'm using their demo code to see if this is viable.
Is there any way to recognize these sounds?
To clarify, I'm asking specifically how to disable this for the Web Speech API and not in general for speech recognition.
The Web Speech API is a very specific way to request speech recognition from the browser itself (in Chrome it goes to Google, in Firefox I believe they have a native solution).
There's more info on it here: https://developer.mozilla.org/en-US/docs/Web/API/Web_Speech_API but it lacks documentation as it varies across browsers, and I am specifically asking to avoid this in Chrome.
Note that webkitSpeechRecognition
records the audio input to the microphone and sends that data to a remote service. The actual code that performs the speech recognition is not shipped with Chromium source code (which Chrome is built from).
The W3C Web Speech API specification does not provide a default means to process ambient noise/music. At Chromium/Chrome browsers developers have no control over how the captured audio is processed by the remote service or the transcript returned from the remote service. The fact that user biometric data is recorded and sent to a remote service is not documented outside of at least one Chromium bug report marked WON'T FIX
, and issues filed at GitHub.
You might be interested in the open source projects Tensorflow and CMU Pocket Sphinx, where you can create your own models. Mozilla Voice Web contains a substantial amount of data that can be used for training TTS/STT models.