There are some posts regarding encoding questions and HtmlAgilityPack
but this issue wasn't addressed:
Because the website I try to parse contains Unicode symbols like €
or ä
, ü
I tried to set the encoding to Unicode:
public class WebpageDeserializer
{
public WebpageDeserializer() {}
/*
* Example address: https://www.dslr-forum.de/showthread.php?t=1930368
*/
public static void Deserialize(string address)
{
var web = new HtmlWeb();
web.OverrideEncoding = Encoding.Unicode;
var htmlDoc = web.Load(address);
//further decoding fails because unicode decoded characters are not proper html (looks more like chinese)
}
}
But now
htmlDoc.DocumentNode.InnerHtml
looks like this:
ℼ佄呃偙⁅瑨汭倠䉕䥌⁃ⴢ⼯㍗⽃䐯䑔堠呈䱍ㄠ〮吠慲獮瑩潩慮⽬䔯≎...
If I try to use UTF-8
or iso-8859-1
the €
symbol is converted to �
(as well as ä
, ö
, ü
). How can I fix this?
Your site is mis-configured and the real encoding is cp1252.
Below code should work:
var client = new HttpClient();
var buf = await client.GetByteArrayAsync("https://www.dslr-forum.de/showthread.php?t=1930368");
var html = Encoding.GetEncoding(1252).GetString(buf);
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);