speech-recognition speech-to-text speech automatic-speech-recognition

Why is Word Information Lost (WIL) calculated the way it is?

Word Information Lost (WIL) is a measure of the performance of an automated speech recognition (ASR) service (e.g. AWS Transcribe, Google Speech-to-Text, etc.) against a gold standard (usually human-generated) transcript, and is generally considered a more sophisticated measure than Word Error Rate (WER).

The formula for WIL is as follows:

...where:

H = hits (matching words between the gold standard and ASR transcripts)
N = total words in the gold standard transcript
P = total words in the ASR transcript
S = substitutions (one word replaced with another)
D = deletions (a word in the gold standard transcript not present in the ASR one)
I = insertions (a word in the ASR transcript not present in the gold standard one)

My question is: why is it calculated this way?

I'm not grasping what exactly WIL is supposed to represent here, especially its exponential nature (in both the numerator and denominator).

It seems like a simpler, more immediately understandable version could be something like:

How would you describe what WIL means, in layman's terms?

Solution

The Word Information Loss (WIL) metric is an approximation of a different metric that was used in the early 2000s to measure the accuracy of automatic speech recognition systems (ASR), being the Relative Information Loss (RIL) metric.

RIL depends on knowing the "similarity" between the words that were inserted, deleted or substituted. This is often called "mutual information". For example, if the ASR system transcribes "piece" instead of "pieced", there is likely to be a high degree of mutual information. But if the ASR transcribes, say "peace" instead of "piece", the mutual information is lower. Similarly, "bred" for "bread" would have less mutual information.

At the heart of RIL is a desire to "rank" errors that are made in transcription so that some types of errors matter less to the overall accuracy score of the ASR system. This is how human speech works; we are better able to "fill the gaps" if we mishear or misunderstand speech because we have a much better grasp of context - we know it's "bread" if we're speaking about breakfast, and we know it's "piece" rather than "peace" if we're talking about pie.

WIL is not dependent on knowing the statistical relationship between the words that were "hit" and those "inserted", "deleted" or "substituted. WIL tries to approximate RIL by "weighting the hits" and "weighting the misses". What I would expect to see with WIL is that it "dampens out" extremely low and extremely high WER; but raises the WER where multiples of insertion, deletion or substitution are made. This allows researchers to target efforts in re-training or fine-tuning the ASR system on particular words, phrases or n-grams.

A particularly interesting application of WIL is used in Whisper:

Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2023, July). Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning (pp. 28492-28518). PMLR.

The researchers behind Whisper realised that small mistakes in transcription - periods, single letters - can skew the WER down, so they essentially ignored these types of errors. They are using a form of WIL, without calling it WIL.