Search code examples
.netvb.netcompressiongzipgzipstream

Is there a problem with IO.Compression?


I've just started compressing file in VB.Net, using the following code. Since I'm targeting Fx 2.0, I can't use the Stream.CopyTo method.

My code, however, gives extremely poor results compared to the gzip Normal compression profile in 7-zip. For example, my code compressed a 630MB outlook archive to 740MB, and 7-zip makes it 490MB.

Here is the code. Is there a blatant mistake (or many?)

Using Input As New IO.FileStream(SourceFile, IO.FileMode.Open, IO.FileAccess.Read, IO.FileShare.Read)
    Using outFile As IO.FileStream = IO.File.Create(DestFile)
        Using Compress As IO.Compression.GZipStream = New IO.Compression.GZipStream(outFile, IO.Compression.CompressionMode.Compress)
            'TODO: Figure out the right buffer size.'
            Dim Buffer(524228) As Byte
            Dim ReadBytes As Integer = 0

            While True
                ReadBytes = Input.Read(Buffer, 0, Buffer.Length)
                If ReadBytes <= 0 Then Exit While
                Compress.Write(Buffer, 0, ReadBytes)
            End While
        End Using
    End Using
End Using

I've tried with multiple buffer sizes, but I get similar compression times, and exactly the same compression ratio.


Solution

  • EDIT, or actually rewrite: It looks like the BCL coders decided to phone it in.

    The implementation in System.dll version 2.0 uses statically defined, hardcoded Huffman trees optimized for plain ASCII text, rather than adaptively generating the Huffman trees as other implementations do. It also doesn't support stored-block optimization (which is how standard GZip/Deflate avoid runaway expansion). As a result, running any sort of file through their implementation other than plain text will result in a much larger file than the input, and Microsoft claims this is by design!

    Save yourself some pain, grab a third party implementation.