Search code examples
c#asp.nethtml-agility-pack

ASP.NET Core HtmlAgilityPack Encoding errors


There are some posts regarding encoding questions and HtmlAgilityPack but this issue wasn't addressed:

Because the website I try to parse contains Unicode symbols like or ä, ü I tried to set the encoding to Unicode:

public class WebpageDeserializer
{
    public WebpageDeserializer() {}

    /*
     * Example address: https://www.dslr-forum.de/showthread.php?t=1930368
    */
    public static void Deserialize(string address)
    {
        var web = new HtmlWeb();
        web.OverrideEncoding = Encoding.Unicode;
        var htmlDoc = web.Load(address);
        //further decoding fails because unicode decoded characters are not proper html (looks more like chinese)
    }
}

But now

htmlDoc.DocumentNode.InnerHtml

looks like this:

ℼ佄呃偙⁅瑨汭倠䉕䥌⁃ⴢ⼯㍗⽃䐯䑔堠呈䱍ㄠ〮吠慲獮瑩潩慮⽬䔯≎...

If I try to use UTF-8 or iso-8859-1 the symbol is converted to (as well as ä, ö, ü). How can I fix this?


Solution

  • Your site is mis-configured and the real encoding is cp1252.

    Below code should work:

    var client = new HttpClient();
    var buf = await client.GetByteArrayAsync("https://www.dslr-forum.de/showthread.php?t=1930368");
    var html = Encoding.GetEncoding(1252).GetString(buf);
    var doc = new HtmlAgilityPack.HtmlDocument();
    doc.LoadHtml(html);