Search code examples
pythongoogle-cloud-speechpython-sounddevice

Sending audio data generated by python-sounddevice.RawInputStream to Google Cloud Speech-to-Text for asynchronous recognition


I'm working on a script that sends data from a microphone to Google Cloud Speech-to-Text API. I need to access gRPC API to produce live readings during recording. Once the recording is completed, I need to access REST API for more precise asynchronous recognition.

The live streaming part is working. It is based on the quickstart sample, but with python-sounddevice instead of pyAudio. The stream below records cffi_backend_buffer objects into a queue, a separate thread collects these objects, converts them to bytes, and feeds them to the API.

import queue

import sounddevice

class MicrophoneStream:
    def __init__(self, rate, blocksize, queue_live, queue):
        self.queue = queue
        self.queue_live = queue_live
        self._audio_stream = sounddevice.RawInputStream(
            samplerate = rate,
            dtype='int16',
            callback = self.callback,
            blocksize = blocksize,
            channels = 1,
            )

    def __enter__(self):
        self._audio_stream.start()
        return self

    def stop(self):
        self._audio_stream.stop()

    def __exit__(self, type, value, traceback):
        self._audio_stream.stop()
        self._audio_stream.close()

    def callback(self, indata, frames, time, status):
        self.queue.put(indata)
        self.queue_live.put(indata)

There is a second queue that I planned to use for asynchronous recognition once the recording is completed. However, just sending the byte string as I did with live recognition does not seem to work:

from google.cloud import speech
    

client = speech.SpeechClient()
config = speech.RecognitionConfig(
    encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
    sample_rate_hertz=16000,
    language_code='en-US',
    max_alternatives=1)

audio_data = []
while not queue.empty():
    audio_data.append(queue.get(False))
audio_data = b"".join(audio_data)

audio = speech.RecognitionAudio(content=audio_data)

response = client.recognize(config=config, audio=audio)

Since sending byte strings of raw audio data works with streaming recognition, I assume that there's nothing wrong with raw data and recognition config. Perhaps there's something more to it? I know that if I read binary data from *.wav file and send it instead of audio_data, recognition works. How do I convert raw audio data to PCM WAV so that I can send it to the API?


Solution

  • Turns out, there are two things wrong with this code.

    1. It looks like the cffi_backend_buffer objects that I put into the queue behave like pointers to a certain area of memory. If I access them right away, as I do in streaming recognition, it works fine. However, if I collect them in a queue for later use, the buffers they point to become overwritten. The solution is to put byte strings into queues instead:
        def callback(self, indata, frames, time, status):
            self.queue.put(bytes(indata))
            self.queue_live.put(bytes(indata))
    
    1. Asynchronous recognition requires PCM WAV files to have headers. Obviously, my raw audio data did not have them. The solution is to write the data into a *.wav file, I did it the following way:
    import io
    import wave
    
    from google.cloud import speech
        
    
    client = speech.SpeechClient()
    config = speech.RecognitionConfig(
        encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
        sample_rate_hertz=16000,
        language_code='en-US',
        max_alternatives=1)
    
    # Collect raw audio data
    audio_data = []
    while not queue.empty():
        audio_data.append(queue.get(False))
    audio_data = b"".join(audio_data)
    
    # Convert to a PCM WAV file with headers
    file = io.BytesIO()
    with wave.open(file, mode='wb') as w:
        w.setnchannels(1)
        w.setsampwidth(2)
        w.setframerate(16000)
        w.writeframes(audio_data)
    file.seek(0)
    
    audio = speech.RecognitionAudio(content=file.read())
    
    response = client.recognize(config=config, audio=audio)