Search code examples
c#.netcsvutf-8iso-8859-15

Which double quote characters are automatically replaced when converting from UTF-8 to ISO-8859-15?


I have an input file that is UTF-8 encoded. I need to use some of its content and create an ISO-8859-15 encoded CSV file from it.

The problem is that UTF-8 seems to have several characters for double quotes that are automatically replaced to the character " (= Quotation Mark U+0022) when writing the CSV file to the disc.

The ones we found are:

The conversion happens automatically when I write to the CSV file like this:

using (StreamWriter sw = new StreamWriter(workDir + "/files/vehicles.csv", append: false, encoding: Encoding.GetEncoding("ISO-8859-15")))
{
    foreach (ad vehicle in vehicles)
    {
        sw.WriteLine(convertVehicleToCsv(vehicle));
    }
}

The method convertVehicleToCsv escapes double quotes and other special characters of the data, but does not escape the special UTF-8 double quote characters. Now that the double quotes are replaced automatically the CSV is no longer RFC-4180 conform and therefore corrupt. Reading it using our CSV library fails.

So the question is:

What other UTF-8 characters are automatically replaced/converted to the "normal" " character when converting to ISO-8859-15? Is this documented somewhere? Or am I doing something wrong here?


Solution

  • To answer your question, here's the list of Unicode code points which .NET is mapping to U+0022 (what you've referred to as "normal double quote" symbol) when using a StreamWriter as you've done:

    • U+0022
    • U+02BA
    • U+030E
    • U+201C
    • U+201D
    • U+201E
    • U+FF02

    Using this answer, I wrote something quickly which creates a reverse mapping of UTF-8 to ISO-8859-15 (Latin-9).

    Encoding utf8 = Encoding.UTF8;
    Encoding latin9 = Encoding.GetEncoding("ISO-8859-15");
    Encoding iso = Encoding.GetEncoding(1252);
    
    var map = new Dictionary<string, List<string>>();
    
    // same code to get each line from the file as per the linked answer
    
    while (true)
    {
        string line = reader.ReadLine();
        if (line == null) break;
        string codePointHexAsString = line.Substring(0, line.IndexOf(";"));
        int codePoint = Convert.ToInt32(codePointHexAsString, 16);
    
        // skip Unicode surrogate area
        if (codePoint >= 0xD800 && codePoint <= 0xDFFF)
            continue;
    
        string utf16String = char.ConvertFromUtf32(codePoint);
        byte[] utf8Bytes = utf8.GetBytes(utf16String);
        byte[] latin9Bytes = Encoding.Convert(utf8, latin9, utf8Bytes);
        string latin9String = latin9.GetString(latin9Bytes);
        byte[] isoBytes = Encoding.Convert(utf8, iso, utf8Bytes);
        string isoString = iso.GetString(isoBytes); // this is not always the same as latin9String!
    
       string latin9HexAsString = latin9[0].ToString("X");
    
        if (!map.ContainsKey(latin9HexAsString))
        {
            isoMap[latin9HexAsString] = new List<string>();
        }
        isoMap[latin9HexAsString].Add(codePointHexAsString);
    }
    

    Interestingly, ISO-8859-15 seems to be replacing more characters than ISO-8859-1, which I didn't expect.