Search code examples
c#htmldomweb-scrapinghtml-agility-pack

Having trouble displaying the node's content with HtmlAgilityPack


I'm having trouble with datascraping on this web address: http://patorjk.com/software/taag/#p=display&f=Graffiti&t=Type%20Something%20.

The problem is: I've written a code that is supposed to grab the contents of a certain node and display it on console. However, the contents withing the node and the specific node itself seem to be unreachable, but I know they exists for the fact that I've created a condition within my code in order to let me know if nodes withing a certain body are being found and it is indeed being found but not displayed for some reason:

private static void getTextArt(string font, string word)
        {
            HtmlWeb web = new HtmlWeb();
            //cureHtml method is just meant to return the http address
            HtmlDocument htmlDoc = web.Load(cureHtml(font, word));
            if(web.Load(cureHtml(font, word)) != null)
                Console.WriteLine("Connection Established");
            else
                Console.WriteLine("Connection Failed!");

            var nodes = htmlDoc.DocumentNode.SelectSingleNode(nodeXpath).ChildNodes;

            foreach(HtmlNode node in nodes)
            {
                if(node != null)
                    Console.WriteLine("Node Found.");
                else
                    Console.WriteLine("Node not found!");

                Console.WriteLine(node.OuterHtml);
            }
        }

        private const string nodeXpath = "//div[@id='maincontent']";
}

The Html displayed by the website looks like this:

The Html code within the website. Arrows point at the node I'm trying to reach and the content within it I'm trying to display on the console

When I run my code on console to check for the node and its contents and try to display the OuterHtml string of the Xpath, this is how console will display it to me:

Console Window Display

I hope some of you are able to explain to me why is it behaving this way. I've tried all kinds of searches on google for two days trying to figure out the problem for no use. Thank you all in advance.


Solution

  • The content you desire is loaded dynamically.

    Use the HtmlWeb.LoadFromBrowser() method instead. Also, check htmlDoc for null, instead of calling it twice. Your current logic doesn't guarantee your state.

            HtmlDocument htmlDoc = web.LoadFromBrowser(cureHtml(font, word));
            if (htmlDoc != null)
                Console.WriteLine("Connection Established");
            else
                Console.WriteLine("Connection Failed!");
    

    Also, you'll need to decode the result.

                Console.WriteLine(WebUtility.HtmlDecode(node.OuterHtml));
    

    If this doesn't work, then your cureHtml() method is broken, or you're targeting .NET Core :)