python google-cloud-platform speech-recognition google-speech-api transcription

Google SpeechML API doesnt work well with noisy audio

I have been trying to develop a python script to transcribe audio from noisy audio files. My specific use case is to get noisy audio parts transcribed correctly. When i send the files to SpeechML API for processing, the responses have either omitted or incorrect responses for noisy audio . Is there any approach to solve this? I have tried couple of tools like sox, speech-recognition wrapper but they didn't help Below is the code i am using

def transcribe_gcs(gcs_uri):
"""Asynchronously transcribes the audio file specified by the gcs_uri."""
from google.cloud import speech
from google.cloud.speech import enums
from google.cloud.speech import types

client = speech.SpeechClient()
audio = types.RecognitionAudio(uri=gcs_uri)
config = types.RecognitionConfig(
         encoding=enums.RecognitionConfig.AudioEncoding.FLAC,
         sample_rate_hertz=48000,
         language_code='en-US')

operation = client.long_running_recognize(config, audio)
print('Waiting for operation to complete...')
response = operation.result(timeout=600)
# Each result is for a consecutive portion of the audio. Iterate through
# them to get the transcripts for the entire audio file.
for result in response.results:
# The first alternative is the most likely one for this portion.
    print('Transcript: {}'.format(result.alternatives[0].transcript))
    print('Confidence: {}'.format(result.alternatives[0].confidence))
# [END def_transcribe_gcs]

if name == 'main':

gcs_uri="gs://speechmldemo/outputclear.flac"   
transcribe_gcs(gcs_uri)

Solution

So far I know the quality of the Speech API results of the audio will always be heavily dependent on the external noise and the overall quality of the recording. The only ways I can think about to substantially improve your results are:

Reduce the noise levels at the source if possible (at the recording)
Digitally filter the noise out before the processing, removing frequency bands not used by human speech. (upwards 4 KHz is standard for telephony)
Use a non compressed audio file preferably (i.e wav), to avoid loss of quality in the compression (as it happens with mp3.)

You might find additional tips to improve the processing in the official documentation