I'm having trouble with datascraping on this web address: http://patorjk.com/software/taag/#p=display&f=Graffiti&t=Type%20Something%20.
The problem is: I've written a code that is supposed to grab the contents of a certain node and display it on console. However, the contents withing the node and the specific node itself seem to be unreachable, but I know they exists for the fact that I've created a condition within my code in order to let me know if nodes withing a certain body are being found and it is indeed being found but not displayed for some reason:
private static void getTextArt(string font, string word)
{
HtmlWeb web = new HtmlWeb();
//cureHtml method is just meant to return the http address
HtmlDocument htmlDoc = web.Load(cureHtml(font, word));
if(web.Load(cureHtml(font, word)) != null)
Console.WriteLine("Connection Established");
else
Console.WriteLine("Connection Failed!");
var nodes = htmlDoc.DocumentNode.SelectSingleNode(nodeXpath).ChildNodes;
foreach(HtmlNode node in nodes)
{
if(node != null)
Console.WriteLine("Node Found.");
else
Console.WriteLine("Node not found!");
Console.WriteLine(node.OuterHtml);
}
}
private const string nodeXpath = "//div[@id='maincontent']";
}
The Html displayed by the website looks like this:
When I run my code on console to check for the node and its contents and try to display the OuterHtml string of the Xpath, this is how console will display it to me:
I hope some of you are able to explain to me why is it behaving this way. I've tried all kinds of searches on google for two days trying to figure out the problem for no use. Thank you all in advance.
The content you desire is loaded dynamically.
Use the HtmlWeb.LoadFromBrowser()
method instead. Also, check htmlDoc
for null
, instead of calling it twice. Your current logic doesn't guarantee your state.
HtmlDocument htmlDoc = web.LoadFromBrowser(cureHtml(font, word));
if (htmlDoc != null)
Console.WriteLine("Connection Established");
else
Console.WriteLine("Connection Failed!");
Also, you'll need to decode the result.
Console.WriteLine(WebUtility.HtmlDecode(node.OuterHtml));
If this doesn't work, then your cureHtml()
method is broken, or you're targeting .NET Core :)