Search code examples
c#.netcompressiongzipstreamsystem.io.compression

Compressing and decompressing very large files using System.IO.Compressing.Gzip


My problem can be described with following statements:

  • I would like my program to be able to compress and decompress selected files
  • I have very large files (20 GB+). It is safe to assume that the size will never fit into the memory
  • Even after compression the compressed file might still not fit into the memory
  • I would like to use System.IO.Compression.GzipStream from .NET Framework
  • I would like my application to be parallel

As I am a newbie to compression / decompression I had following idea on how to do it:

I could use split the files into chunks and compress each of them separately. Then merge them back into a whole compressed file. Picture showing my idea

Question 1 about this approach - Is compressing multiple chunks and then merging them back together going to give me the proper result i.e. if I were to reverse the process (starting from compressed file, back to decompressed) will I receive the same original input?

Question 2 about this approach - Does this approach make sense to you? Perhaps you could direct me towards some good lecture about the topic? Unfortunately I could not find anything myself.


Solution

  • You do not need to chunk the compression just to limit memory usage. gzip is designed to be a streaming format, and requires on the order of 256KB of RAM to compress. The size of the data does not matter. The input could be one byte, 20 GB, or 100 PB -- the compression will still only need 256KB of RAM. You just read uncompressed data in, and write compressed data out until done.

    The only reason to chunk the input as you diagram is to make use of multiple cores for compression. Which is a perfectly good reason for your amount of data. Then you can do exactly what you describe. So long as you combine the output in the correct order, the decompression will then reproduce the original input. You can always concatenate valid gzip streams to make a valid gzip stream. I would recommend that you make the chunks relatively large, e.g. megabytes, so that the compression is not noticeably impacted by the chunking.

    Decompression cannot be chunked in this way, but it is much faster so there would be little to no benefit even if you could. The decompression is usually i/o bound.