How is the data used for speech recognition collected and prepared?

As far as I can tell, most speech recognition implementations rely on binary files that contain acoustic models of the language they are trying to 'recognize'.

So how do people compile these models?

One could transcribe lots of speeches manually, but that takes a lot of time. Even then, when given an audio file containing some speech and a full transcription of it in a text file, the individual word pronunciations still need to somehow be separated. To match which parts of the audio correspond to the text one still needs speech recognition.

How is this gathered? If one is handed over thousands of hours' worth of audio files and their full transcriptions (disregarding the problem of having to transcribe manually), how can the audio be split up at the right intervals where one word ends and another begins? Wouldn't the software producing these acoustic models already have to be capable of speech recognition?

Solution

So how do people compile these models?

You can learn about process by going through CMUSphinx acoustic model training tutorial

One could transcribe lots of speeches manually, but that takes a lot of time.

This is correct, model preparation takes a lot of time. Speech is transcribed manually. You can also take already transcribed speech like movies with subtitles or transcribed lectures or audiobooks and use them for training.

Even then, when given an audio file containing some speech and a full transcription of it in a text file, the individual word pronunciations still need to somehow be separated. To match which parts of the audio correspond to the text one still needs speech recognition.

You need to separate speech on sentences of 5-20 seconds long, not on words. Speech recognition training can learn model from sentences called utterances, it can segment on words automatically. This segmentation is done in unsupervised way, essentially it is a clustering, so it does not require the system to recognize the speech, it just detects the chunks of similar structure in the sentence and assigns them to phones. This makes speech training way easier than if you'd train on separate words.

How is this gathered? If one is handed over thousands of hours' worth of audio files and their full transcriptions (disregarding the problem of having to transcribe manually), how can the audio be split up at the right intervals where one word ends and another begins? Wouldn't the software producing these acoustic models already have to be capable of speech recognition?

You need to initialize system from some manually transcribed recording database of size of 50-100 hours. You can read about examples here. For many popular languages like English, French, German, Russian such databases already exist. For some they are in progress in the dedicated resource.

Once you have initial database you can take a large set of videos and segment them using existing model. That helps to create databases of thousands of hours. For example such database was trained from Ted talks, you can read about it here.