Search code examples
c#archiveunzipcompressiondeflate

Unzip a file with a particular extension (not .zip)


How to unzip downloaded compressed files (compressed with Deflate method, ANSI encoded) with the following file characteristics :

  • Dynamic extension (for example .23U or .23M)
  • Compressed with Deflate method and ANSI encoded
  • Can be opened by 7-Zip as an archive but still humanly unreadable when the extraction is opened.

And the following technical points :

  • Using DeflateStream by any way is not working
  • Using GZipStream is for .gz
  • DotNetZip libray is often recommended but is too heavy to reference to the project (and not really documented)

In another c++ project (which I actually need to imitate the behavior of in C#), a dunzip.dll library is used and leads to readable characters. We can see online that there exists a dunzip32.dll libray for C#, but there is no documentation about how to use it.

EDIT :

Here are the first 100 bytes (in decimal) of the array of bytes that I get from the compressed file :

80 75 3 4 20 0 8 0 8 0 67 75 79 76 0 0 0 0 0 0 0 0 0 0 0 0 12 0 0 0 50 50 67 67 48 48 48 49 46 50 51 85 204 189 117 88 85 251 215 238 77 9 2 210 221 157 210 221 221 221 139 142 69 119 119 119 119 119 119 119 135 32 8 136 10 38 2 10 38 138 138 29 32 234 59 231 210 189 55 107 242 187 246 251 158 115 158 231 60 239 255

And here is a report I get of the 100 first bytes in hexa :

0000-0010:  50 4b 03 04-14 00 08 00-08 00 60 4b-47 4c 00 00  PK...... ..`KGL..
0000-0020:  00 00 00 00-00 00 00 00-00 00 0c 00-00 00 32 32  ........ ......22
0000-0030:  43 43 30 30-30 31 2e 32-33 55 cc bd-75 58 55 fb  CC0001.2 3U..uXU.
0000-0040:  d7 ee 4d 09-02 d2 dd 9d-d2 dd dd dd-8b 8e 45 77  ..M..... ......Ew
0000-0050:  77 77 77 77-77 77 87 20-08 88 0a 26-02 0a 26 8a  wwwwww.. ...&..&.
0000-0060:  8a 1d 20 ea-3b e7 d2 bd-37 6b f2 bb-f6 fb 9e 73  ....;... 7k.....s
0000-0064:  9e e7 3c ef                                      ..<.

The fact that it starts with 50 4b 03 04 means that it is a format based on zip : File signatures information Then, considering it as a zip file, I tried to decompress the data with simple methods from msdn examples, using a MemoryStream in one case and a FileStream in the other case.

public static string UnzipString3(byte[] byteArrayCompressedContent)
{
    using (var outputStream = new MemoryStream())
    {
        using (var compressStream = new MemoryStream(byteArrayCompressedContent))
        {
            using (var deflateStream = new DeflateStream(compressStream, CompressionMode.Decompress))
            {
                deflateStream.CopyTo(outputStream);
            }
        }
        return Encoding.UTF8.GetString(outputStream.ToArray());
    }
}


public void UnzipProperZipFile()
{
        try
        {
            using (var outputStream = new MemoryStream())
            {
                FileInfo fileInfo = new FileInfo("NormalZip.zip");
                FileStream fileStream = fileInfo.OpenRead();
                fileStream.Position = 2;
                using (var deflateStream = new DeflateStream(fileStream, CompressionMode.Decompress))
                {
                    deflateStream.CopyTo(outputStream);
                }
                string res = Encoding.UTF8.GetString(outputStream.ToArray());
            }
        }
        catch (Exception e)
        {
            Console.WriteLine("errorlole");
        }
}

In both cases, it is giving a "Block length does not match with its complement" error. However, it is the recommended method by Microsoft which is supposed to be working that way. I realized that if I consume the first two bytes, it does not give that error but instead will lead to an empty string...

EDIT :

I'm apparently facing the same issue with "proper" zip files (even if other people has succeed with the same algorithms), so I'm gonna try with external unzipping libraries.

EDIT

It is working with the Class ZipArchive suggested by Lasse Vågsæther Karlsen, and his code is working 10/10 in getting the data after a decompression. Now the thing left is to be able to have understandable data. I actually can't know much about the data, excepted :

  • Notepad tells me it's ANSI encoded

When I transfer the data in a MemoryStream after having extracted the file, I try to get it in all of the used encoding ;

entryStream.CopyTo(memoryStream);
string laChaineUTF8 = Encoding.UTF8.GetString(memoryStream.ToArray());
string laChaineDefault = Encoding.Default.GetString(memoryStream.ToArray());
string laChaineUnicode = Encoding.Unicode.GetString(memoryStream.ToArray());
string laChaineASCII = Encoding.ASCII.GetString(memoryStream.ToArray());
string laChaineBigEndianUnicode = Encoding.BigEndianUnicode.GetString(memoryStream.ToArray());
string laChaineUTF7 = Encoding.UTF7.GetString(memoryStream.ToArray());
string laChaineUTF32 = Encoding.UTF32.GetString(memoryStream.ToArray());

None of them is giving an understandable chain of characters.


Solution

  • The problem is that a .ZIP file is much more than simply deflated data. There's directory structures, checksums, file metadata, etc. etc.

    You need to use a class that knows about this structure. Unless the file is using some of the more advanced stuff, such as encryption and spanning archives, the .NET ZipArchive class probably does the trick.

    Here's a simple program that extracts the contents of a text file from the zip archive. You must adapt it to your needs:

    using (var file = File.Open(@"D:\Temp\Temp.zip", FileMode.Open))
    using (var archive = new ZipArchive(file))
    {
        var entry = archive.GetEntry("ttt/README.md");
        using (var entryStream = entry.Open())
        using (var memory = new MemoryStream())
        {
            entryStream.CopyTo(memory);
            Console.WriteLine(Encoding.UTF8.GetString(memory.ToArray()));
        }
    }