Search code examples
javaparallel-processingzipdistributeddeflate

Compress file with zip algorithm in Java on multiple hosts


My problem is zip compression. I have to split file in parts and parallely compress them, then connect the parts in correct order and save as zip archive with one file. Splitting file and sending parts to hosts isn't a problem - I'm using jpvm. My question is: how to split compression? I've tried to use java.util.zip.Deflater to compress every part (result is byte array), and then write them into one ZipOutputStream, but this doesn't seems to work - while saving to file it compresses already compressed bytes once more. Do I have to compress every part with deflater and then manually add zip header, some checksum or something like that? Does Deflater add any headers? I appreciate any help, thank you!


Solution

  • You need to use the nowrap option of Deflater to produce a raw deflate stream with no headers or trailers. Then you will need to wrap that raw deflate stream with the appropriate zip headers and trailers yourself.

    To create a single deflate stream on multiple processors, you need to be able to flush the compressed output to a byte boundary (for the pieces that are not the last piece) using the Z_SYNC_FLUSH operation in zlib. (The last piece would be finished normally.) Then the pieces can be simply concatenated.

    The Java 7 (but not Java 6) documentation supports this with the optional fourth parameter of the deflate() method. That can be set to SYNC_FLUSH.

    Breaking up the data in this way will degrade compression, since each block cannot benefit from the history of the preceding block. This can be solved using the setDictionary() method. Provide to each thread both the data to compress as well as the 32K bytes of uncompressed data that precedes it. Then use the 32K with setDictionary(), followed by the deflate().

    You can see pigz for an example of parallel compression in C using zlib directly.

    Once you have your deflate stream, you wrap it appropriately to make it a zip file. See the appnote for the zip file format. You will also need to compute the CRC-32 of the uncompressed data to be able to fill in those fields.