Search code examples
c#.netgzipstream

GZipStream quietly fails on large file, stream ends at 2GB


I'm having trouble using GZipStream to decompress the FreebaseRDF dump (30GB gzipped text, 480GB uncompressed) where the stream ends prematurely. No exception is thrown, just gz.Read() starts returning zero:

using(var gz = new GZipStream(File.Open("freebase-rdf-latest.gz", FileMode.Open), CompressionMode.Decompress))
{
    var buffer = new byte[1048576];
    int read, total = 0;
    while ((read = gz.Read(buffer, 0, buffer.Length)) > 0)
        total += read;

    // total is 1945715682 here
    // subsequent reads return 0
}

The file unpacks fine with other applications (I tried gzip and 7zip).

Sniffing around I found this note in the previous version of the GZipStream documentation on MSDN:

The GZipStream class might not be able to decompress data that results in over 8 GB of uncompressed data.

The note has been removed in the latest version of the doc. I'm using .NET 4.5.2 and for me the stream ended after just under 2GB had been decompressed.

Does anyone know more about this limitation? The language in the docs implies other preconditions than just unpacking more than 8gb - and I'm fairly certain I've used GZipStream in the past to process very large files without hitting this.

Also, can anyone recommend a drop-in replacement for GZipStream that I might use instead of System.IO.Compression?

update

I tried replacing System.IO.Compression with Ionic.Zlib (DotNetZip) and got the same result.

I tried ICSharpCode.SharpZipLib's GZipInputStream and got "unknown block type 6" on the very first read.

I tried SevenZipSharp but there is no stream decorator for reading - there's only various blocking "Extract" methods to unpack the entire stream which is not what I want.

another update

Using zlib1.dll, the following code unpacks the entire file correctly. It also does it in about 1/4th the time as GZipStream!

var gzFile = gzopen("freebase-rdf-latest.gz", "rb");

var buffer = new byte[1048576];
int read, total = 0;
while ((read = gzread(gzFile, buffer, buffer.Length)) > 0)
    total += read;

[DllImport("zlib1")] IntPtr gzopen(string path, string mode);
[DllImport("zlib1")] int gzread(IntPtr gzFile, byte[] buf, int len);
[DllImport("zlib1")] int gzclose(IntPtr gzFile);

..so apparently all of the exsting GZip libraries in .NET have some compatibility issue with zlib. The zlib1.dll I used was from my mingw64 directory (there's about a dozen zlib1.dll's on my machine but this was the only 64bit one).


Solution

  • I'm a bit late, but I have found the reason and a solution for this problem.

    This large file contains not just one gzip-stream but rather exactly 200 streams. (compressed size per gzip-stream: 150-155 MB)

    First "gzip-file" use the optional extra-fields to describe the lengths for all compressed gzip-stream. Many Uncompressors did not supporting this streaming-style and decode only the first entry. (150 MB -> 2 GB)

    1.: the read-header-method: (sorry if looks like a hacking-style :-)

    static long[] ReadGzipLengths(Stream stream)
    {
      if (!stream.CanSeek || !stream.CanRead) return null; // can seek and read?
    
      int fieldBytes;
      if (stream.ReadByte() == 0x1f && stream.ReadByte() == 0x8b // gzip magic-code
          && stream.ReadByte() == 0x08 // deflate-mode
          && stream.ReadByte() == 0x04 // flagged: has extra-field
          && stream.ReadByte() + stream.ReadByte() + stream.ReadByte() + stream.ReadByte() >= 0 // unix timestamp (ignored)
          && stream.ReadByte() == 0x00 // extra-flag: sould be zero
          && stream.ReadByte() >= 0 // OS-Type (ignored)
          && (fieldBytes = stream.ReadByte() + stream.ReadByte() * 256 - 4) > 0 // length of extra-field (subtract 4 bytes field-header)
          && stream.ReadByte() == 0x53 && stream.ReadByte() == 0x5a // field-header: must be "SZ" (mean: gzip-sizes as uint32-values)
          && stream.ReadByte() + stream.ReadByte() * 256 == fieldBytes // should have same length
        )
      {
        var buf = new byte[fieldBytes];
        if (stream.Read(buf, 0, fieldBytes) == fieldBytes && fieldBytes % 4 == 0)
        {
          var result = new long[fieldBytes / 4];
          for (int i = 0; i < result.Length; i++) result[i] = BitConverter.ToUInt32(buf, i * sizeof(uint));
          stream.Position = 0; // reset stream-position
          return result;
        }
      }
    
      // --- fallback for normal gzip-files or unknown structures ---
      stream.Position = 0; // reset stream-position
      return new[] { stream.Length }; // return single default-length
    }
    

    2.: the reader

    static void Main(string[] args)
    {
      using (var fileStream = File.OpenRead(@"freebase-rdf-latest.gz"))
      {
        long[] gzipLengths = ReadGzipLengths(fileStream);
        long gzipOffset = 0;
    
        var buffer = new byte[1048576];
        long total = 0;
    
        foreach (long gzipLength in gzipLengths)
        {
          fileStream.Position = gzipOffset;
    
          using (var gz = new GZipStream(fileStream, CompressionMode.Decompress, true)) // true <- don't close FileStream at Dispose()
          {
            int read;
            while ((read = gz.Read(buffer, 0, buffer.Length)) > 0) total += read;
          }
    
          gzipOffset += gzipLength;
    
          Console.WriteLine("Uncompressed Bytes: {0:N0} ({1:N2} %)", total, gzipOffset * 100.0 / fileStream.Length);
        }
      }
    }
    

    3.: results

    Uncompressed Bytes: 1.945.715.682 (0,47 %)
    Uncompressed Bytes: 3.946.888.647 (0,96 %)
    Uncompressed Bytes: 5.945.104.284 (1,44 %)
    ...
    ...
    Uncompressed Bytes: 421.322.787.031 (99,05 %)
    Uncompressed Bytes: 423.295.620.069 (99,53 %)
    Uncompressed Bytes: 425.229.008.315 (100,00 %)
    

    Needs some time (30-40 min) but it works! (without external libs)

    Speed: about 200 MB/s uncompressing data-rate

    With few changes, should be multithreading possible.