Search code examples
rubyalgorithmpseudocodeedit-distance

Pseudocode for script to check transcription accuracy / edit distances


I need to write a script, probably in Ruby, that will take one block of text and compare a number of transcriptions of recordings of that text to the original to check for accuracy. If that's just completely confusing, I'll try explaining another way...

I have recordings of several different people reading a script that is a few sentences long. These recordings have all been transcribed back to text a number of times by other people. I need to take all of the transcriptions (hundreds) and compare them against the original script for accuracy.

I'm having trouble even conceptualising the pseudocode, and wondering if someone can point me in the right direction. Is there an established algorithm I should be considering? The Levenshtein distance has been suggested to me, but this seems like it wouldn't cope well with longer strings, considering differences in punctuation choices, whitespace, etc.--missing the first word would wreck the entire algorithm, even if every other word were perfect. I'm open to anything--thank you!

Edit:

Thanks for the tips, psyho. One of my biggest concerns, however, is a situation like this:

Original Text:

I would've taken that course if I'd known it was available!

Transcription

I would have taken that course if I'd known it was available!

Even with a word-wise comparison of tokens, this transcription will be marked as quite errant, even though it's almost perfect, and this is hardly an edge-case! "would've" and "would have" are commonly pronounced extremely similarly, especially in this part of the world. Is there a way to make the approach you suggest robust enough to deal with this? I've thought about running a word-wise comparison both forward and backward and building a sort of composite score, but this would fall apart with a transcription like this:

I would have taken that course if I had known it was available!

Any ideas?


Solution

  • After experimenting with the issues I noted in this question, I found that the Levenshtein Distance actually takes these problems into account. I don't fully understand how or why, but can see after experimentation that this is the case.