Search code examples
speech-recognitionibm-cloudspeech-to-textibm-watson

wav vs opus: speech-to-text transcript quality


I am using IBM Watson's speech-to-text service to generate transcripts for few telephony audio files (8kHz). I have tried both wav and opus versions of the same files. I haven't seen any major degradation in the quality of transcript while using opus format. I am thinking of storing just the opus format of the files to reduce storage space requirement and to decrease file transfer time. In general is it better to use wav format for higher quality transcripts? Is there any known degradation in the quality of transcript if we use opus format?


Solution

  • If the bitrate is enough OPUS should not degrade the recognition accuracy. You should use the lowest bitrate that does not degrade accuracy, which can be determined experimentally (try different bitrates and compute Word Error Rate).

    Alternatively you can use FLAC, which is lossless and typically offers a compression factor of 5X compared to uncompressed wav.

    Finally, keep in mind that you do not want the sampling rate to be higher than 16kHz, since that wont be useful for recognition and will increase the storage considerably.