Search code examples
c#asp.netajaxwinformshtml-agility-pack

HTMLAgilityPack load AJAX content for scraping


I'm trying to scrape a webpage using HTMLAgilityPack in a C# Web Forms project.

All the solutions Ive seen for doing this use a WebBrowser control. However, from what I can determine, this is only available in WinForms projects.

At present I'm calling the required page via this code:

var getHtmlWeb = new HtmlWeb();
var document = getHtmlWeb.Load(inputUri);
HtmlAgilityPack.HtmlNodeCollection nodes = document.DocumentNode.SelectNodes("//div[@class=\"nav\"]");

An example bit of code that I've seen saying to use the WebBrowser control:

if (this.webBrowser1.Document.GetElementsByTagName("html")[0] != null)
_htmlAgilityPackDocument.LoadHtml(this.webBrowser1.Document.GetElementsByTagName("html")[0].OuterHtml);

How can I grab the page once AJAX has been loaded?


Solution

  • It seems that using HTMLAgilityPack it is only possible to scrape content that is loaded via the html itself. Thus anything loaded via AJAX will not be visible to HTMLAgilityPack.

    Perhaps the easiest option -where feasible- is to use a browser based tool such as Firebug to determine the source of the data loaded by AJAX. Then manipulate the source data directly. An added advantage of this might be the ability to scrape a larger dataset.