Search code examples
c#character-encodingzipc#-ziparchive

ZipArchive and Encoding


I am trying to make sense of the following piece of documentation:

Reading the "Remarks" section, in particular this portion:

When you open a zip archive file for reading and entryNameEncoding is set to a value other than null, entry names are decoded according to the following rules:

When the language encoding flag is not set, the specified entryNameEncoding is used to decode the entry name.

When the language encoding flag is set, UTF-8 is used to decode the entry name.

If I understand that section, I should be able to use something like:

new ZipArchive(utf8Stream, ZipArchiveMode.Read, true, Encoding.Latin1))

to be able to handle both modern UTF-8 style ZIP file, as well as handle some legacy Latin1 file that may have been written out by legacy application.

However here is what I am observing today on my system. The following xUnit test is printing:

language encoding flag is set, and entry names are encoded by using UTF-8.
language encoding flag is set, and entry names are encoded by using UTF-8.
language encoding flag is set, and entry names are encoded by using UTF-8.
ZipArchive.Latin1:
finalePräsentation.pdf
münchen.pdf
Übersicht.pdf
ZipArchive.Default:
finalePräsentation.pdf
münchen.pdf
Ãœbersicht.pdf

Code is simply:

[Fact]
public async Task ZipArchive_Encoding()
{
    string[] entryNames = {"finalePräsentation.pdf", "münchen.pdf", "Übersicht.pdf"};
    var latin1Stream = new MemoryStream();
    var utf8Stream = new MemoryStream();
    {
        using (var archiveOut = new ZipArchive(latin1Stream, ZipArchiveMode.Create, true, Encoding.Latin1))
        {
            foreach (var entryName in entryNames)
            {
                /*
                 * When you write to archive files and entryNameEncoding is set to a value other than null,
                 * the specified entryNameEncoding is used to encode the entry names into bytes.
                 * The language encoding flag (in the general-purpose bit flag of the local file header)
                 * is set only when the specified encoding is a UTF-8 encoding.
                 */
                var entry = archiveOut.CreateEntry(entryName);
                await using var writer = new StreamWriter(entry.Open());
                await writer.WriteAsync("Hello World!");
            }
        }

        latin1Stream.Position = 0;

        using (var archiveOut = new ZipArchive(utf8Stream, ZipArchiveMode.Create, true, null))
        {
            foreach (var entryName in entryNames)
            {
                var containsOnlyAscii = entryName.All(char.IsAscii);
                if (!containsOnlyAscii)
                    output.WriteLine("language encoding flag is set, and entry names are encoded by using UTF-8.");
                else
                    output.WriteLine(
                        "the language encoding flag is not set, and entry names are encoded by using the current system default code page");
                var entry = archiveOut.CreateEntry(entryName);
                await using var writer = new StreamWriter(entry.Open());
                await writer.WriteAsync("Hello World!");
            }
        }

        utf8Stream.Position = 0;
    }
    {
        output.WriteLine("ZipArchive.Latin1:");
        using (var archiveIn = new ZipArchive(latin1Stream, ZipArchiveMode.Read, true, Encoding.Latin1))
        {
            foreach (var entry in archiveIn.Entries)
                output.WriteLine(entry.FullName);
        }

        output.WriteLine("ZipArchive.Default:");
        // When you open a zip archive file for reading and entryNameEncoding is set to a value other than null,
        using (var archiveIn = new ZipArchive(utf8Stream, ZipArchiveMode.Read, true, Encoding.Latin1))
        {
            // When the language encoding flag is set, UTF-8 is used to decode the entry name.
            foreach (var entry in archiveIn.Entries)
                output.WriteLine(entry.FullName);
        }
    }
}

Solution

  • Apparently this is a known regression: