I'm trying to get every 2nd span tag inside a div from a HTML node collection, but for some reason, i'm just getting the 1st span tag. I suspect is something about my XPath expression, but i don't have certain.
Program.cs
static void Main(string[] args) {
var doc = new HtmlDocument();
doc.Load("test.html");
var htmlNodes = doc.DocumentNode.SelectNodes("//body/div/div/div");
foreach (var node in htmlNodes) {
Console.WriteLine(node.ChildNodes[1].InnerText);
}
}
HTML file
<doctype! html>
<html lang='pt-br'>
<head>
<title>Teste</title>
<meta charset='utf-8'/>
<!-- Bootstrap -->
<link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.5.0/css/bootstrap.min.css"
integrity="sha384-9aIt2nRpC12Uk9gS9baDl411NQApFmC26EwAOH8WgZl5MYYxFfc+NcPb1dKGj7Sk" crossorigin="anonymous">
<script src="https://code.jquery.com/jquery-3.5.1.slim.min.js" integrity="sha384-DfXdz2htPH0lsSSs5nCTpuj/zy4C+OGpamoFVy38MVBnE+IbbVYUew+OrCXaRkfj" crossorigin="anonymous"></script>
<script src="https://cdn.jsdelivr.net/npm/[email protected]/dist/umd/popper.min.js" integrity="sha384-Q6E9RHvbIyZFJoft+2mJbHaEWldlvI9IOYy5n3zV9zzTtmI3UksdQRVvoxMfooAo" crossorigin="anonymous"></script>
<script src="https://stackpath.bootstrapcdn.com/bootstrap/4.5.0/js/bootstrap.min.js" integrity="sha384-OgVRvuATP1z7JjHLkuOU7Xw704+h835Lr+6QL9UvYjZE3Ipu6Tp75j7Bh/kR0JKI" crossorigin="anonymous"></script>
<!-- Custom CSS -->
<link rel="stylesheet" type="text/css" href="./styles.css"/>
</head>
<body>
<div class="container-fluid">
<h1 class="title">Relatório</h1>
<div id="infoField" class="container">
<div>
<span>Matricula: </span>
<span>1111</span> <!-- Supposed to be this span tag -->
</div>
<div>
<span>Nome: </span>
<span>any</span> <!-- Supposed to be this span tag -->
</div>
<div>
<span>Sobrenome: </span>
<span>any</span> <!-- Supposed to be this span tag -->
</div>
<div>
<span>Porto: </span>
<span>2</span> <!-- Supposed to be this span tag -->
</div>
</div>
</div>
</body>
</html>
Returned values
Matricula:
Nome:
Sobrenome:
Porto:
I've got a hunch that HtmlAgilityPack is reading a text node between your inner <div>
and the first <span>
.
That text node would be Node 0, making Node 1 (node.ChildNodes[1]
) your first <span>
.
This happens because some (most?) HTML parsers read anything that isn't a tag as text, including white space. And you have white space in HTML, between the <div>
and the <span>
.
The only way to NOT have white space, and therefore a text node, would be to write the tags up against each other, like this:
<div><span>Matricula:</span><span>1111</span></div>
If you include the text node between <div>
and <span>
, and the one that would be between the two <span>
tags, your second <span>
would be Node 3. So, this line will probably work:
Console.WriteLine(node.ChildNodes[3].InnerText);
But you probably don't want to have to reckon with text nodes and space in the HTML. You just want the <span>
tags!
Having refreshed my memory of HtmlAgilityPack, I think this will serve you better:
foreach (var node in htmlNodes) {
Console.WriteLine(node.Elements("span")[1].InnerText);
}
Reference: https://html-agility-pack.net/elements