Search code examples
c#linqhtml-agility-pack

htmlAgilityPack parse table to datatable or array


I have these tables:

<table>
<tbody>
<tr><th>Header 1</th></tr>
</tbody>
</table>

<table>
<tbody>
<tr>
<th>Header 1</th>
<th>Header 2</th>
<th>Header 3</th>
<th>Header 4</th>
<th>Header 5</th>
</tr>
<tr>
<td>text 1</td>
<td>text 2</td>
<td>text 3</td>
<td>text 4</td>
<td>text 5</td>
</tr>
</tbody>
</table>

I am trying to transform into an array or List using this code:

var query = from table in doc.DocumentNode.SelectNodes("//table").Cast<HtmlNode>()
                         from row in table.SelectNodes("tr").Cast<HtmlNode>()
                         from header in row.SelectNodes("th").Cast<HtmlNode>()
                         from cell in row.SelectNodes("td").Cast<HtmlNode>()
                         select new { 
                             Table = table.Id, 
                             Row = row.InnerText, 
                             Header = header.InnerText,
                             CellText = cell.InnerText
                         };

But it doesn't work. What is wrong?


Solution

  • Some notes:

    • You do not need a cast
    • you are assuming that each row have headers
    • SelectNodes needs to receive an xpath and you are passing just names

    if i were you i would use a foreach and model my data, that way i get to have more control and efficiency, but if you still want to do it your way this is how it should be

    var query = from table in doc.DocumentNode.SelectNodes("//table")
                where table.Descendants("tr").Count() > 1 //make sure there are rows other than header row
                from row in table.SelectNodes(".//tr[position()>1]") //skip the header row
                from cell in row.SelectNodes("./td") 
                from header in table.SelectNodes(".//tr[1]/th") //select the header row cells which is the first tr
                select new
                {
                  Table = table.Id,
                  Row = row.InnerText,
                  Header = header.InnerText,
                  CellText = cell.InnerText
                };