Search code examples
c#xpathweb-scrapinghtml-parsing

Locate XPath content of HTML in C#


I am working in C# .net Core.

  • I have HTML files
  • For each file I have a XPATH which points to part of the page

Which library/nuget package can I use in C# to extract my data?

I want:

extractedData = xpathLib.Extract(htmlContent, xpath)

I do not want to use a technique which load a html browser process (like selenium driver opening chrome) since I have to extract 10 000 of webpages per day.

regards. ps: i have seen microsoft provides xpath lib, but it targets only xml.


Solution

  • You can use HTML Agility Pack

    This nuget works with XPATH, XDocument and LINQ. And easy to use.

    Here is an example from HTML Agility Pack:

    var url = "http://html-agility-pack.net/";
    var web = new HtmlWeb();
    var doc = web.Load(url);
    var value = doc.DocumentNode.SelectNodes("//td/input");