I'm trying to write a Python-3.6 script that would set apart empty .aif audio records (i.e. containing ambient noise only) from those which contain speech. My aim is not to recognize speech content - firstly, it's not English, and secondly, it's not needed for my purposes.
Nonetheless, I failed to invent something better than to use SpeechRecognition with pocketsphinx for resolving this issue. My idea was quite primitive:
import speech_recognition as sr
r = sr.Recognizer()
emptyRecords = []
for fname in os.listdir(TESTDIR):
with sr.AudioFile(TESTDIR + fname) as source:
recorded = r.record(source)
recognized = r.recognize_sphinx(recorded)
if len(recognized) <= 10:
print("{} seems to be an empty record.".format(fname))
emptyRecords.append(fname)
That is, I tried to transform recorded audios into lists of strings and to separate those lists according to an 'intuitive' threshold of 10 words, since sometimes pocketsphinx recognized background noise as a sparse sequence of several interjections. However, this took an extremely long time because of unnecessary speech recognition process - for each record I only needed to check whether it containted more than 10 words and nothing more. As far as I understand from SpeechRecognition's docs, the Recognizer class doesn't contain any attributes or methods that would limit the number of words to be recognized.
Could someone suggest a better idea for this issue?
Thanks in advance.
Try webrtcvad lib. Set aggressiveness mode and test with your recorded data.