Search code examples
c#htmlhtml-parsinghtml-agility-pack

HTML Agility pack - parsing tables


I want to use the HTML agility pack to parse tables from complex web pages, but I am somehow lost in the object model.

I looked at the link example, but did not find any table data this way. Can I use XPath to get the tables? I am basically lost after having loaded the data as to how to get the tables. I have done this in Perl before and it was a bit clumsy, but worked. (HTML::TableParser).

I am also happy if one can just shed a light on the right object order for the parsing.


Solution

  • How about something like: Using HTML Agility Pack

    HtmlDocument doc = new HtmlDocument();
    doc.LoadHtml(@"<html><body><p><table id=""foo""><tr><th>hello</th></tr><tr><td>world</td></tr></table></body></html>");
    foreach (HtmlNode table in doc.DocumentNode.SelectNodes("//table")) {
        Console.WriteLine("Found: " + table.Id);
        foreach (HtmlNode row in table.SelectNodes("tr")) {
            Console.WriteLine("row");
            foreach (HtmlNode cell in row.SelectNodes("th|td")) {
                Console.WriteLine("cell: " + cell.InnerText);
            }
        }
    }
    

    Note that you can make it prettier with LINQ-to-Objects if you want:

    var query = from table in doc.DocumentNode.SelectNodes("//table").Cast<HtmlNode>()
                from row in table.SelectNodes("tr").Cast<HtmlNode>()
                from cell in row.SelectNodes("th|td").Cast<HtmlNode>()
                select new {Table = table.Id, CellText = cell.InnerText};
    
    foreach(var cell in query) {
        Console.WriteLine("{0}: {1}", cell.Table, cell.CellText);
    }