I want to add timestamps to book sentences, fitting the relevant audiobook. In various languages ideally.
Here's an example:
Pride and prejudice
text from gutenberg project
audio from Librivox
My idea was to find a voice recognition tool that puts timestamps on sentences (step 1), and then map the messy transcription to the original text using levenshtein distances (step 2).
The website https://speechlogger.appspot.com/ offers a solution to the 1st step, but it's limited in character output. I could theoritically use web automation to get the job done, by starting a new recording every minute or so, but it's really dirty.
I scripted step 2 in R and tested it on a sample I got from speechlogger and it works okayish, but this could be greatly improved if the program knew the text, like when you read to train a speech recognition software. I'm not using all my information here by transcribing first.
So my questions are, what alternative ways could i have to timestamp audio files, and is there a way i can make my process smarter by letting the recognition engine know what it's supposed to recognize ?
There are many nice software packages developed for that with various level of accuracy:
Gentle - Kaldi-based aligner, works as a service.
Older implementations:
Aligner Demo in Sphinx4 - CMUSphinx toolkit in java
SAIL align - HTK-based aligner, quite some pack of perl scripts.