I'm working on a script that sends data from a microphone to Google Cloud Speech-to-Text API. I need to access gRPC API to produce live readings during recording. Once the recording is completed, I need to access REST API for more precise asynchronous recognition.
The live streaming part is working. It is based on the quickstart sample, but with python-sounddevice instead of pyAudio. The stream below records cffi_backend_buffer
objects into a queue, a separate thread collects these objects, converts them to bytes, and feeds them to the API.
import queue
import sounddevice
class MicrophoneStream:
def __init__(self, rate, blocksize, queue_live, queue):
self.queue = queue
self.queue_live = queue_live
self._audio_stream = sounddevice.RawInputStream(
samplerate = rate,
dtype='int16',
callback = self.callback,
blocksize = blocksize,
channels = 1,
)
def __enter__(self):
self._audio_stream.start()
return self
def stop(self):
self._audio_stream.stop()
def __exit__(self, type, value, traceback):
self._audio_stream.stop()
self._audio_stream.close()
def callback(self, indata, frames, time, status):
self.queue.put(indata)
self.queue_live.put(indata)
There is a second queue that I planned to use for asynchronous recognition once the recording is completed. However, just sending the byte string as I did with live recognition does not seem to work:
from google.cloud import speech
client = speech.SpeechClient()
config = speech.RecognitionConfig(
encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
sample_rate_hertz=16000,
language_code='en-US',
max_alternatives=1)
audio_data = []
while not queue.empty():
audio_data.append(queue.get(False))
audio_data = b"".join(audio_data)
audio = speech.RecognitionAudio(content=audio_data)
response = client.recognize(config=config, audio=audio)
Since sending byte strings of raw audio data works with streaming recognition, I assume that there's nothing wrong with raw data and recognition config. Perhaps there's something more to it? I know that if I read binary data from *.wav file and send it instead of audio_data
, recognition works. How do I convert raw audio data to PCM WAV so that I can send it to the API?
Turns out, there are two things wrong with this code.
cffi_backend_buffer
objects that I put into the queue behave like pointers to a certain area of memory. If I access them right away, as I do in streaming recognition, it works fine. However, if I collect them in a queue for later use, the buffers they point to become overwritten. The solution is to put byte strings into queues instead: def callback(self, indata, frames, time, status):
self.queue.put(bytes(indata))
self.queue_live.put(bytes(indata))
import io
import wave
from google.cloud import speech
client = speech.SpeechClient()
config = speech.RecognitionConfig(
encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
sample_rate_hertz=16000,
language_code='en-US',
max_alternatives=1)
# Collect raw audio data
audio_data = []
while not queue.empty():
audio_data.append(queue.get(False))
audio_data = b"".join(audio_data)
# Convert to a PCM WAV file with headers
file = io.BytesIO()
with wave.open(file, mode='wb') as w:
w.setnchannels(1)
w.setsampwidth(2)
w.setframerate(16000)
w.writeframes(audio_data)
file.seek(0)
audio = speech.RecognitionAudio(content=file.read())
response = client.recognize(config=config, audio=audio)