Search code examples
c#encodingiso-8859-1

C# Encoding: Getting special characters from their codes


I am using a C# WinForms app to scrape some data from a webpage that uses charset ISO-8859-1. It works well for many special characters, but not all.

(* Below I use colons instead of semi-colons so that you will see the code that I see, and not the value of it)

I looked at the Page Source and I noticed that for the ones that won't display correctly, the actual code (e.g. &#363:) is in the Page Source, instead of the value. For example, in the Page Source I see Ry&#363: Murakami, but I expect to see Ryū Murakami. Also, there are many other codes that appear as codes, such as &#350: &#333: &#353: &#269: &#259: &#537: and many more.

I have tried using WebClient.DownloadString and WebClient.DownloadData.

Try #1 Code:

using (WebClient wc = new WebClient())
{
wc.Encoding = Encoding.GetEncoding("ISO-8859-1");
string WebPageText = wc.DownloadString("http://www.[removed].htm");
// Scrape WebPageText here
}

Try #2 Code:

Encoding iso = Encoding.GetEncoding("ISO-8859-1");
Encoding utf8 = Encoding.UTF8;
using (WebClient wc = new WebClient())
{
wc.Encoding = iso;
byte[] AllData = wc.DownloadData("http://www.[removed].htm");
byte[] utfBytes = Encoding.Convert(iso, utf8, AllData);
string WebPageText = utf8.GetString(utfBytes);
// Scrape WebPageText here
}

I want to keep the special characters, so please don't suggest any RemoveDiacritics examples. Am I missing something?


Solution

  • Consider Decoding your HTML input.