I am trying to make sense of the following piece of documentation:
Reading the "Remarks" section, in particular this portion:
When you open a zip archive file for reading and entryNameEncoding is set to a value other than null, entry names are decoded according to the following rules:
When the language encoding flag is not set, the specified entryNameEncoding is used to decode the entry name.
When the language encoding flag is set, UTF-8 is used to decode the entry name.
If I understand that section, I should be able to use something like:
new ZipArchive(utf8Stream, ZipArchiveMode.Read, true, Encoding.Latin1))
to be able to handle both modern UTF-8 style ZIP file, as well as handle some legacy Latin1 file that may have been written out by legacy application.
However here is what I am observing today on my system. The following xUnit test is printing:
language encoding flag is set, and entry names are encoded by using UTF-8.
language encoding flag is set, and entry names are encoded by using UTF-8.
language encoding flag is set, and entry names are encoded by using UTF-8.
ZipArchive.Latin1:
finalePräsentation.pdf
münchen.pdf
Übersicht.pdf
ZipArchive.Default:
finalePräsentation.pdf
münchen.pdf
Ãœbersicht.pdf
Code is simply:
[Fact]
public async Task ZipArchive_Encoding()
{
string[] entryNames = {"finalePräsentation.pdf", "münchen.pdf", "Übersicht.pdf"};
var latin1Stream = new MemoryStream();
var utf8Stream = new MemoryStream();
{
using (var archiveOut = new ZipArchive(latin1Stream, ZipArchiveMode.Create, true, Encoding.Latin1))
{
foreach (var entryName in entryNames)
{
/*
* When you write to archive files and entryNameEncoding is set to a value other than null,
* the specified entryNameEncoding is used to encode the entry names into bytes.
* The language encoding flag (in the general-purpose bit flag of the local file header)
* is set only when the specified encoding is a UTF-8 encoding.
*/
var entry = archiveOut.CreateEntry(entryName);
await using var writer = new StreamWriter(entry.Open());
await writer.WriteAsync("Hello World!");
}
}
latin1Stream.Position = 0;
using (var archiveOut = new ZipArchive(utf8Stream, ZipArchiveMode.Create, true, null))
{
foreach (var entryName in entryNames)
{
var containsOnlyAscii = entryName.All(char.IsAscii);
if (!containsOnlyAscii)
output.WriteLine("language encoding flag is set, and entry names are encoded by using UTF-8.");
else
output.WriteLine(
"the language encoding flag is not set, and entry names are encoded by using the current system default code page");
var entry = archiveOut.CreateEntry(entryName);
await using var writer = new StreamWriter(entry.Open());
await writer.WriteAsync("Hello World!");
}
}
utf8Stream.Position = 0;
}
{
output.WriteLine("ZipArchive.Latin1:");
using (var archiveIn = new ZipArchive(latin1Stream, ZipArchiveMode.Read, true, Encoding.Latin1))
{
foreach (var entry in archiveIn.Entries)
output.WriteLine(entry.FullName);
}
output.WriteLine("ZipArchive.Default:");
// When you open a zip archive file for reading and entryNameEncoding is set to a value other than null,
using (var archiveIn = new ZipArchive(utf8Stream, ZipArchiveMode.Read, true, Encoding.Latin1))
{
// When the language encoding flag is set, UTF-8 is used to decode the entry name.
foreach (var entry in archiveIn.Entries)
output.WriteLine(entry.FullName);
}
}
}
Apparently this is a known regression: