I have a method that takes a IFormFile
with a .zip
extension and reads the files that are zipped inside it. It works well, however, there are cases in which the encoding of the symbols æ, å, ø
is wrong and they are represented as question marks. This doesn't happen every time, I will give some examples here.
This is how the code looks like
var stream = file.OpenReadStream(); // This here is the stream from the IFormFile
var archive = new ZipArchive(stream);
foreach (var entry in archive.Entries)
{
// Do something with the entries
}
When the problem appears the entry.Name
can look like something like this Tilbud 2 - Det r�de hus.pdf
.
I have tried giving props to the ZipArchive
constructor as such
var archive = new ZipArchive(stream, ZipArchiveMode.Read, false, Encoding.ASCII);
When the encoding is ASCII
, the file names look like this - Tilbud 2 - Det r?de hus.pdf
.
And when I try the Encoding.Unicode
, the corrupted symbols take the value of the question mark (not sure if the example I am giving is with the actual number) - Tilbud 2 - Det r\u300de hus.pdf
The front-end is just axios post request. My though is that some zips have a different encoding when they come to the back-end and that is why sometimes those symbols work.
I have also tried changing the encoding of the IFormFile
stream to UTF-8 before calling the ZipArchive
, but this was not that straightforward, so I fist wanted to find out if that is actually the problem.
Feel free to ask questions and give suggestions for the strange behaviour.
Not worked with net6 ZipArchive
class, so this is not direct answer. But per my previous experience with a zip archives, you can get literally any encoding of the filenames inside an archive, including very exotic one. And as I know there is no standard place into the archive where the exact encoding may be stored for future usage during the decompress. At the same time for success decompress you should use exactly the same encoding as was used during compression.
Personally I ended up with heuristic encoding detection using chardetsharp
(https://github.com/superstrom/chardetsharp). This significantly increased success rate for unpacking but do not solved problem entirely (i.e. I still have some archives with decompression errors or with incorrect filenames, when the encoding is predicted incorrectly).
Sample usage with SharpCompress
library (don't know how to integrate this intoZipArchive
from a BCL):
var arch = ArchiveFactory.Open(path,
new SharpCompress.Readers.ReaderOptions
{
LookForHeader = true,
ArchiveEncoding = new SharpCompress.Common.ArchiveEncoding
{
CustomDecoder = CustomDecoder
}
});
arch.WriteToDirectory(targetDir, new SharpCompress.Common.ExtractionOptions
{
ExtractFullPath=true,
});
private static string CustomDecoder(byte[] arg1, int arg2, int arg3)
{
var bytesToRead = arg1.Skip(arg2).Take(arg3).ToArray();
if (bytesToRead.Length == 0)
{
return null;
}
var det = new Mozilla.CharDet.UniversalDetector();
det.HandleData(bytesToRead);
det.DataEnd();
var enc = Encoding.GetEncoding(det.DetectedCharsetName ?? Encoding.UTF8.BodyName);
return enc.GetString(bytesToRead);
}
Looks like the chardetsharp
implementation that I used does not have a nuget package. This was not a problem for my particular task and I simply cloned the chardetsharp
sources and added \src\CharDet\CharDet.csproj
to the solution and then used it as a ProjectReference
. Then I called chardetsharp
inside my CustomDecoder
as shown above.
In production, you can try to use another encoding detection library, for example NChardet
(https://github.com/thinksea/NChardet) which is also a C# port of the original Mozilla library, so it should have a similar API (but I have not used it, so I'm not sure) and have a nuget package.