Search code examples
audioraspberry-piaudio-recordingsoxvoice-recording

Using sox for voice detection and streaming


Currently, I use sox like this:

sox -d -e u-law --endian little -b 8 -c 1 -r 8000 -t ul - silence 1 0.3 1% 1 0.3 1%

For reference, this is recording audio from the default microphone and outputting little endian, ulaw formatted audio at 8 bits and a 8k rate. The effects filter trims audio until the noise hits a threshold for 0.3 seconds, then continues to record until there is 0.3 seconds of silence. All of this streams to stdout which I use to stream to a remote server.

I am using all of this to record a bit of voice and finish when I am done speaking. To trigger sox, I use specialized hardware to trigger the start of the recording. I can switch to using almost any audio format or codec as long as it supports on the fly formatting/encoding. My target platform is raspbian on the raspberry pi 2 B.

My ideal solution would be to use vad to stop the recording when the user is finished speaking. My hope is that this would work even with background chatter. However, the sox documentation on the vad effect states this:

The use of the norm effect is recommended, but remember that neither reverse nor norm is suitable for use with streamed audio.

I haven't been able to piece parameters together to get vad and streaming working. Is it possible to use the vad effect to stop the recording of audio while still maintaining the stdin->sox->stdout piping? Are there better alternatives?


Solution

  • Is it possible to use the vad effect to stop the recording of audio while still maintaining the stdin->sox->stdout piping?

    No. The vad effect can trim silence only from the front of the audio. So you could only use it to detect recording start, and not ending and pauses.

    The reverse and norm filters need all the input data before they produce any data on output, that is why they cannot be used with streaming.

    The key is to select a good threshold for silence filter so it takes "background chatter" as silence.

    You could use also noisered (with a profile based on previous recordings) before silence to reduce noise triggering the recording, but this will also affect output and probably will not take "background chatter" as noise.