Search code examples
c#decodeencodenon-ascii-characters

C# decoding non Ascii characters?


I am reading a meta description from couple of sites using HtmlAgilityPack.

I noticed if it is not English letters it does not decode the special characters. (such as Japaneses letters).

I am using Encoding UTF8 - should I be using something else.

byte[] bytes = Encoding.Default.GetBytes(item.Attributes["content"].Value);
return Encoding.UTF8.GetString(bytes);

Solution

  • As per your comment, seems like your website is using SHIFT-JIS encoding, not UTF-8. I've added two samples for UTF-8 and SHIFT-JIS.

            using (var client = new WebClient())
            {
                // UTF-8
                var content = client.DownloadString("http://www3.nhk.or.jp/news/");
                var doc = new HtmlDocument();
                doc.LoadHtml(content);
                var metaDescNode = doc.DocumentNode.SelectSingleNode("//meta[@name=\"description\"]");
                var bytes = Encoding.Default.GetBytes(metaDescNode.Attributes["content"].Value);
                var decodedMetaDesc = Encoding.UTF8.GetString(bytes); // This string has decoded characters
    
                // Shift_JIS
                var japaneseEncoding = Encoding.GetEncoding(932); 
                var content2 = client.DownloadString("http://www.toronto-electricians.com/");
                var doc2 = new HtmlDocument();
                doc2.LoadHtml(content2);
                var metaDescNode2 = doc2.DocumentNode.SelectSingleNode("//meta[@name=\"description\"]"); 
                var bytes2 = Encoding.Default.GetBytes(metaDescNode2.Attributes["content"].Value);
                var decodedMetaDesc2 = japaneseEncoding.GetString(bytes2); // This string has decoded characters
            }
    

    Screenshot #1 from debugger.

    enter image description here

    Screenshot #2 from debugger.

    enter image description here