Search code examples
c#htmldomhtml-agility-pack

How to get a specific span tag from a HTML Node collection


I'm trying to get every 2nd span tag inside a div from a HTML node collection, but for some reason, i'm just getting the 1st span tag. I suspect is something about my XPath expression, but i don't have certain.

Program.cs

static void Main(string[] args) {
    var doc = new HtmlDocument();
    doc.Load("test.html");
            
    var htmlNodes = doc.DocumentNode.SelectNodes("//body/div/div/div");
    foreach (var node in htmlNodes) {
        Console.WriteLine(node.ChildNodes[1].InnerText);
    }
}

HTML file

<doctype! html>

<html lang='pt-br'>
    <head>
        <title>Teste</title>
        <meta charset='utf-8'/>

        <!-- Bootstrap -->
        <link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.5.0/css/bootstrap.min.css"
        integrity="sha384-9aIt2nRpC12Uk9gS9baDl411NQApFmC26EwAOH8WgZl5MYYxFfc+NcPb1dKGj7Sk" crossorigin="anonymous">
        <script src="https://code.jquery.com/jquery-3.5.1.slim.min.js" integrity="sha384-DfXdz2htPH0lsSSs5nCTpuj/zy4C+OGpamoFVy38MVBnE+IbbVYUew+OrCXaRkfj" crossorigin="anonymous"></script>
        <script src="https://cdn.jsdelivr.net/npm/[email protected]/dist/umd/popper.min.js" integrity="sha384-Q6E9RHvbIyZFJoft+2mJbHaEWldlvI9IOYy5n3zV9zzTtmI3UksdQRVvoxMfooAo" crossorigin="anonymous"></script>
        <script src="https://stackpath.bootstrapcdn.com/bootstrap/4.5.0/js/bootstrap.min.js" integrity="sha384-OgVRvuATP1z7JjHLkuOU7Xw704+h835Lr+6QL9UvYjZE3Ipu6Tp75j7Bh/kR0JKI" crossorigin="anonymous"></script>
        
        <!-- Custom CSS -->
        <link rel="stylesheet" type="text/css" href="./styles.css"/>
    </head>

    <body>
        <div class="container-fluid">
            <h1 class="title">Relatório</h1>

            <div id="infoField" class="container">
                <div>
                    <span>Matricula: </span>
                    <span>1111</span> <!-- Supposed to be this span tag -->
                </div>

                <div>
                    <span>Nome: </span>
                    <span>any</span> <!-- Supposed to be this span tag -->
                </div>

                <div>
                    <span>Sobrenome: </span>
                    <span>any</span> <!-- Supposed to be this span tag -->
                </div>

                <div>
                    <span>Porto: </span>
                    <span>2</span> <!-- Supposed to be this span tag -->
                </div> 
            </div>
        </div>
    </body>
</html>

Returned values

Matricula:
Nome:
Sobrenome:
Porto:

Solution

  • I've got a hunch that HtmlAgilityPack is reading a text node between your inner <div> and the first <span>.

    That text node would be Node 0, making Node 1 (node.ChildNodes[1]) your first <span>.

    This happens because some (most?) HTML parsers read anything that isn't a tag as text, including white space. And you have white space in HTML, between the <div> and the <span>.

    The only way to NOT have white space, and therefore a text node, would be to write the tags up against each other, like this:

    <div><span>Matricula:</span><span>1111</span></div>
    

    If you include the text node between <div> and <span>, and the one that would be between the two <span> tags, your second <span> would be Node 3. So, this line will probably work:

    Console.WriteLine(node.ChildNodes[3].InnerText);
    

    But you probably don't want to have to reckon with text nodes and space in the HTML. You just want the <span> tags!

    Having refreshed my memory of HtmlAgilityPack, I think this will serve you better:

    foreach (var node in htmlNodes) {
        Console.WriteLine(node.Elements("span")[1].InnerText);
    }
    

    Reference: https://html-agility-pack.net/elements