Is there a fast way to find (not necessarily recognize) human speech in an audio file?

I want to write a program that automatically syncs unsynced subtitles. One of the solutions I thought of is to somehow algorythmically find human speech and adjust the subtiles to it. The APIs I found (Google Speech API, Yandex SpeechKit) work with servers (which is not very convinient for me) and (probably) do a lot of unnecessary work determining what exactly has been said, while I only need to know that something has been said.

In other words, I want to give it the audio file and get something like this:

[(00:12, 00:26), (01:45, 01:49) ... , (25:21, 26:11)]

Is there a solution (preferably in python) that only finds human speech and runs on a local machine?

Solution

The technical term for what you are trying to do is called Voice Activity Detection (VAD). There is a python library called SPEAR that does it (among other things).