Search code examples
pythonanacondaspeech-recognitionpocketsphinx

Python: setting apart speech from empty audio records


I'm trying to write a Python-3.6 script that would set apart empty .aif audio records (i.e. containing ambient noise only) from those which contain speech. My aim is not to recognize speech content - firstly, it's not English, and secondly, it's not needed for my purposes.

Nonetheless, I failed to invent something better than to use SpeechRecognition with pocketsphinx for resolving this issue. My idea was quite primitive:

        import speech_recognition as sr

        r = sr.Recognizer()
        emptyRecords = []
        for fname in os.listdir(TESTDIR):
            with sr.AudioFile(TESTDIR + fname) as source:
                recorded = r.record(source)
                recognized = r.recognize_sphinx(recorded)
            if len(recognized) <= 10:
                print("{} seems to be an empty record.".format(fname))
                emptyRecords.append(fname)

That is, I tried to transform recorded audios into lists of strings and to separate those lists according to an 'intuitive' threshold of 10 words, since sometimes pocketsphinx recognized background noise as a sparse sequence of several interjections. However, this took an extremely long time because of unnecessary speech recognition process - for each record I only needed to check whether it containted more than 10 words and nothing more. As far as I understand from SpeechRecognition's docs, the Recognizer class doesn't contain any attributes or methods that would limit the number of words to be recognized.

Could someone suggest a better idea for this issue?

Thanks in advance.


Solution

  • Try webrtcvad lib. Set aggressiveness mode and test with your recorded data.

    https://pypi.org/project/webrtcvad/