Search code examples
javascriptaudiogoogle-speech-apioggopus

Splitting an Ogg Opus File stream


I am trying to send an OGG_OPUS encoded stream to google's speech to text streaming service. Since there is a time limit imposed by Google for their stream requests, I have to route the audio stream to another Google Speech To Text streaming session on a fixed interval.

From what I've read, the pages in the OGG stream cannot be read independently since the data in the pages are calculated by considering the data of the previous and next pages. If that is the case, can we cut off the stream at a certain point and recreate a brand new stream with the remaining data? Stopping at a certain point and sending the data in a new stream just doesn't work because the initial OGG header packets are also no available in the second stream.

I know that this issue can be solved using PCM data, since its not encoded, a PCM stream can simply be split at any point and turned into a new stream. I cannot use a PCM stream due to the heavy bitrate, also I prefer not to use lossless quality since I'm transferring a voice data stream.

Refs: https://www.rfc-editor.org/rfc/rfc7845#section-3


Solution

  • OpusFileSplitter can split Opus audio files.

    The Ogg pages can be read independently as long as the file starts with the Beginning of Stream (BOS) header and comment page. You can split one Ogg file into multiple files by creating new files that start with the Ogg header page and have Ogg data/audio pages after . For example, this Ogg Opus file:

    *********************************************************
    *          *              *              *              *
    *  Header  *  Audio Data  *  Audio Data  *  Audio Data  *
    *   Page   *    Page 1    *    Page 2    *    Page 3    *
    *          *              *              *              *
    *********************************************************
    

    Could be split into 2 files:

    ***************************
    *          *              *
    *  Header  *  Audio Data  *
    *   Page   *    Page 1    *
    *          *              *
    ***************************
    
    ******************************************
    *          *              *              *
    *  Header  *  Audio Data  *  Audio Data  *
    *   Page   *    Page 2    *    Page 3    *
    *          *              *              *
    ******************************************
    

    You're correct regarding audio segments that could be split and span across multiple pages. I'm assuming that a few milliseconds could be lost if a page contains incomplete audio segments, but that should not disrupt speech recognition. Unfortunately, my local tests used Opus files generated by opusenc util, which didn't create pages that split segments across pages, which seems to be a good thing for splitting files!

    OpusFileSplitter.scanPages() shows how to find the page boundaries.