Search code examples
c#gzipout-of-memorycompressiongzipstream

GZIP decompression C# OutOfMemory


I have many large gzip files (approximately 10MB - 200MB) that I downloaded from ftp to be decompressed.

So I tried to google and find some solution for gzip decompression.

    static byte[] Decompress(byte[] gzip)
    {
        using (GZipStream stream = new GZipStream(new MemoryStream(gzip), CompressionMode.Decompress))
        {
            const int size = 4096;
            byte[] buffer = new byte[size];
            using (MemoryStream memory = new MemoryStream())
            {
                int count = 0;
                do
                {
                    count = stream.Read(buffer, 0, size);
                    if (count > 0)
                    {
                        memory.Write(buffer, 0, count);
                    }
                }
                while (count > 0);
                return memory.ToArray();
            }
        }
    }

it works well for any files below 50mb but once i have input more than 50mb I got system out of memory exception. Last position and the length of memory before exception is 134217728. I don't think it has relation with my physical memory, I understand that I can't have object more than 2GB since I use 32-bit.

I also need to process the data after decompress the files. I'm not sure if memory stream is the best approach here but I don't really like write to file and then read the files again.

My questions

  • why did I get System.OutMemoryException?
  • what is the best possible solution to decompress gzip files and do some text processing afterwards?

Solution

  • Memory allocation strategy for MemoryStream is not friendly for huge amounts of data.

    Since contract for MemoryStream is to have contiguous array as underlying storage it has to reallocate array often enough for large stream (often as log2(size_of_stream)). Side effects of such reallocation are

    • long copy delays on reallocation
    • new array must fit in free address space already heavily fragmented by previous allocations
    • new array will be on LOH heap that have its quirks (no compaction, collection on GC2).

    As result handling large (100Mb+) stream through MemoryStream will likely case out of memory exception on x86 systems. In addition most common pattern to return data is to call GetArray as you do which additionally requires about the same amount of space as last array buffer used for MemoryStream.

    Approaches to solve:

    • The cheapest way is to pre-grow MemoryStream to approximate size you need (preferably slightly large). You can pre-compute size that is required by reading to fake stream that does not store anything (waste of CPU resources, but you'll be able to read it). Consider also returning stream instead of byte array (or return byte array of MemoryStream buffer along with length).
    • Another option to handle it if you need whole stream or byte array is to use temporary file stream instead of MemoryStream to store large amount of data.
    • More complicated approach is to implement stream that chunks underlying data in smaller (i.e. 64K) blocks to avoid allocation on LOH and copying data when stream needs to grow.