Search code examples
c#xmlencoding

C# - Visual studio code - encoding problem


I have a folder with inside an XML file like this:

<?xml version="1.0" encoding="UTF-8"?>
<cities>
   <result>
      <city_id>-3870534</city_id>
      <country>mx</country>
      <name>Santa Bárbara</name>
      <nr_hotels>0</nr_hotels>
      <translations>
         <language>en-gb</language>
         <name>Santa Bárbara</name>
      </translations>
      <translations>
         <language>ru</language>
         <name>Санта-Барбара</name>
      </translations>
   </result>
</cities>
<!-- RUID: [UmFuZG9tSVYkc2RlIyh9YcxtmfhRwqry58sgWYNIgEV1AjdsVswrKUorBoUlR6ylFgiaj5XJ0w0DP0lL/htWqOKtE33w1EhBbLABKokIfEo=] -->

The file looks well formatted, in utf8, it contains Russian terms and symbols like "á" in Santa Bárbara.
I should read this file and create a record in a MySql DB (through C#), but I'm facing encoding problems.

PS: the DB table has a few columns (to store city id, country and city translations), all text fields, utf8_general_ci.

I'm trying the following code to read the files (just one in this case) in a folder

foreach (string file in Directory.EnumerateFiles("C:\xml_folder\"" + sub_folder, "*.xml")) {
    Console.WriteLine(file);

    string response = File.ReadAllText(file, Encoding.GetEncoding("Windows-1252"));

    Console.WriteLine(response);

    var document = XDocument.Parse(response);

    foreach (var child in document.Root.Elements("result")) {
         //... code here
 
        String name_it = "";
        String name_en = "";
        String name_es = "";
        String name_fr = "";
        String name_de = "";
        String name_ru = "";

        foreach (var translationsChild in child.Elements("translations"))
        {
            switch (translationsChild.Element("language").Value)
            {
                case "it":
                    bytes = Encoding.Default.GetBytes(translationsChild.Element("name").Value);
                    name_it = Encoding.UTF8.GetString(bytes);
                    break;
                case "en-gb":
                    bytes = Encoding.Default.GetBytes(translationsChild.Element("name").Value);
                    name_en = Encoding.UTF8.GetString(bytes);
                    break;
                case "es":
                    bytes = Encoding.Default.GetBytes(translationsChild.Element("name").Value);
                    name_es = Encoding.UTF8.GetString(bytes);
                    break;
                case "fr":
                    bytes = Encoding.Default.GetBytes(translationsChild.Element("name").Value);
                    name_fr = Encoding.UTF8.GetString(bytes);
                    break;
                case "de":
                    bytes = Encoding.Default.GetBytes(translationsChild.Element("name").Value);
                    name_de = Encoding.UTF8.GetString(bytes);
                    break;
                case "ru":
                    bytes = Encoding.Default.GetBytes(translationsChild.Element("name").Value);
                    name_ru = Encoding.UTF8.GetString(bytes);
                    Console.WriteLine(name_ru);
                    break;
            }
        }

In a few words, I get the file, than I convert it in XML to read all children and save it into the DB.

The problem seems related to the way (encoding) I'm getting the string from the file, I tried conversion in Windows-1252.

string response = File.ReadAllText(file, Encoding.GetEncoding("Windows-1252"));

I even tried conversion in utf8

string response = File.ReadAllText(file, System.Text.Encoding.UTF8);

but every time I get (in the debug console and in the DB), this:

Santa Bárbara -\> Santa B?rbara
Санта-Барбара -\> ?????-??????

It looks like a problem related to the way File.ReadAllText(...) works, encoding is not working at all.

PS: to store data into the DB I use a DML like this:

cmd.CommandText = "INSERT INTO cities (city_id,country,name,nr_hotels,name_it,name_en,name_es,name_fr,name_de,name_ru,last_modified_date) VALUES(@city_id,@country,@name,@nr_hotels,@name_it,@name_en,@name_es,@name_fr,@name_de,@name_ru,@last_modified_date) on duplicate key update city_id=@city_id,country=@country,name=@name,nr_hotels=@nr_hotels,name_it=@name_it,name_en=@name_en,name_es=@name_es,name_fr=@name_fr,name_de=@name_de,name_ru=@name_ru,last_modified_date=@last_modified_date";

Please, can you help me?
thanks in advance


Solution

  • I don't see any sense in converting to a byte array and back. This works properly for me

        string response = File.ReadAllText(file, Encoding.UTF8);
        var document = XDocument.Parse(response);
    
        foreach (var child in document.Root.Elements("result"))
        {
            //... code here
    
            String name_en = "";
            String name_ru = "";
    
    
            foreach (var translationsChild in child.Elements("translations"))
            {
                var name = translationsChild.Element("name").Value;
                Console.WriteLine(name);
                switch (translationsChild.Element("language").Value)
                {
                    case "en-gb":
                        name_en = name;
                        break;
    
                    case "ru":
                        name_ru = name;
                        break;
                }
            }
        }
    

    output

    Santa Bárbara
    Санта-Барбара