Search code examples
c#html-agility-pack

parsing html and selecting first table with second span


Hi I am parsing the html page with htmlagilitypack My app has been working well before, but recently had problems due to html page changes.

The problem is I'm parsing a specific table (first table), but it looks like it's going to another table after parsing it and giving an ArgumentOutOfRangeException error

This is html content

<table>
<tbody>
<tr>
<td class="a1">
<a href="/subtitles/joker-2019/farsi_persian/2110062">
<span class="l r positive-icon">
Farsi/Persian
</span>
<span>
Joker.2019.WEBRip.XviD.MP3-SHITBOX
</span>
</a>
</td>
<td class="a3">
</td>
<td class="a40">
&nbsp;
</td>
<td class="a5">
<a href="/u/695804">
meisam_t72
</a>
</td>
<td class="a6">
<div>
►► زیرنویس از میثم ططری - ویرایش شده ◄◄ - meisam_t72 کانال تلگرام&nbsp; </div>
</td>
</tr>
</tbody>
</table>

<table>
<tr>
<td class="a1">
<a href="/subtitles/joker-2019/farsi_persian/2087508">
<span class="l r bad-icon">
Farsi/Persian
</span>
<span>
Joker.2019.1080p.HC.HDRip.1400MB.DD2.0.x264-GalaxyRG
</span>
</a>
</td>
<td class="a3">
</td>
<td class="a40">
&nbsp;
</td>
<td class="a5">
<a href="/u/546114">
filmb.in
</a>
</td>
<td class="a6">
<div>
filmbin.Cloud | با نسخه HC-HDRip هماهنگ شد&nbsp; </div>
</td>
</tr>
</table>

And that's how page parsing

HtmlDocument doc = await web.LoadFromWebAsync(url);
HtmlNode table = doc.DocumentNode.SelectSingleNode("//table[1]//tbody");
if (table != null)
                {
                    foreach ((HtmlNode cell, int index) in table.SelectNodes(".//tr/td").WithIndex())
                    {
                       // i get error in this line
                       string Name = cell.SelectNodes("//span[2]")[index].InnerText;
                    }
                }

The point is that first all the items in the first table are parsed well and when I enter the second table (which should not be) I get an error.


Solution

  • Couple of things I noticed that could be causing the issue in your code.

    1. In your foreach loop, you are using // to search for span[2]. You should know that // will cause the search to look at your entire code, not just cell. So, it would select Name x number of times the loop executes (which is 5 in this case).

    2. Use of index doesnt quite seem to be valid here. There is always x number of spans but you are basing the index on index of tr which doesnt quite make sense. that is what throws the exception

      Out of Range Exception.

    3. Seems like you are aware, but want to iterate it again. // will search the entire document. .// will iterate under the tag you are on, if within a loop.

    4. Following code will produce the output you are looking for within //table/tbody. Anything under //table/tr will be ignored because tr is "not" under tbody. Second table in your example will be ignored.

        HtmlNode table = doc.DocumentNode.SelectSingleNode("//table//tbody");
        if (table != null)
        {
            foreach (HtmlNode cell in table.SelectNodes(".//tr/td"))
            {
                var nodeWithSpanTag = cell.SelectSingleNode(".//span[2]");
                if (nodeWithSpanTag != null)
                    Console.WriteLine(nodeWithSpanTag.InnerText.Trim());
            }
        }
    

    Output

        Joker.2019.WEBRip.XviD.MP3-SHITBOX
    

    if i use //span (instead of .//span), i get the above line printed 5 times.

    This example will produce the same result as above,

     Console.WriteLine(doc.DocumentNode.SelectSingleNode("//table//tbody//tr/td//span[2]")?.InnerText.Trim());