I want to write a program that automatically syncs unsynced subtitles. One of the solutions I thought of is to somehow algorythmically find human speech and adjust the subtiles to it. The APIs I found (Google Speech API, Yandex SpeechKit) work with servers (which is not very convinient for me) and (probably) do a lot of unnecessary work determining what exactly has been said, while I only need to know that something has been said.
In other words, I want to give it the audio file and get something like this:
[(00:12, 00:26), (01:45, 01:49) ... , (25:21, 26:11)]
Is there a solution (preferably in python) that only finds human speech and runs on a local machine?
The technical term for what you are trying to do is called Voice Activity Detection (VAD). There is a python library called SPEAR that does it (among other things).