Search code examples
algorithmstreamcompressiondeflatelzw

Interleaving/multiplexing compressed streams


I'm looking for a good compression algorithm or library that will let me interleave several compressed streams of data into a single stream of data, without performance or compression loss.

More context: I have been working on a compression format dedicated to a specific application. This compression format performs a bunch of domain-specific analysis on the data, then outputs the compressed data as a bunch of streams, each of which is compressed (typically with LZW, but that's not written in stone). One of these streams, the primary stream, contains a bunch of tokens. Each token from the primary stream contains the information I need to deduce which of the secondary streams contains my next token, how many decompressed bytes I need to read from that secondary stream, and what I need to do with this token.

So far, so good, but I don't want to ship my final data as a bunch of concatenated streams, because that would require me to receive the entire data before I can process it meaningfully. Rather, I'd like to ship them as a single stream, that I could decode and process as I receive it. So this would basically like to send something along the lines of:

  • one token from the primary stream (compressed with the dictionary used for the primary stream);
    • ok, this is token "FooBar", I know that a token "FooBar" is always followed by two tokens from stream "Foo", one token from stream "Bar";
  • two tokens from stream "Foo" (compressed with the dictionary used for stream "Foo");
  • one token from stream "Bar" (compressed with the dictionary used for stream "Bar");
  • one token from the primary stream (back to the compression of the primary stream);
  • ...

The complication here is that any interesting compression algorithm isn't going to trivially translate one token to one or more bytes with a well-defined token end. Sometimes, it's going to be several packets for a single token. Sometimes, one packet will contain many tokens. Luckily for me, the size of my tokens is easy to predict. On the other hand, for space efficiency, I can't afford to write the size of each packet or the number of packets whenever I add one.

So, how can I multiplex/interleave all my compressed streams into a single stream without needing to add lots of metadata? My impression is that this is basically the kind of issues solved by multimedia formats, but I have zero domain knowledge on the topic. Any suggestion? I'm interested in algorithms, libraries and papers.


Solution

  • With, for example, zlib, you can have three instances of deflate running at the same time for your three streams. With deflate you can compress a deflate block at a time (using Z_BLOCK), and bring that to a byte boundary with an empty stored block using Z_SYNC_FLUSH. You can interleave these deflate blocks as they are produced with a one-byte header for each identifying which of the three streams it is from. Then your decompressor reads in these deflate blocks and decompresses them with three instances of inflate, pulling your tokens from the respective blocks of uncompressed data as it becomes available.