I try to parse a autogenerated html file. It is from a HAT and i have no influence in the generated html.
<!DOCTYPE html>
<html lang="de">
<head>
<!-- Header bla bla -->
</head>
<body class="md-nav-expanded">
<!-- Some HTML-Elements, that doesn't matter -->
<div id="main">
<article>
<div id="topic-content" class="container-fluid">
<!-- Uninteresting div -->
<a id="main-content"></a>
<h2>Steuerelemente</h2>
<div class="main-content">
<h6 class="rvps5"><a name="MyAnchor1"></a><span class="rvts0"><span class="rvts13">Title 1</span></span></h6>
<p class="rvps3"><span class="rvts8">Some text for Title 1 with inner HTML elements.</span></p>
<h6 class="rvps5"><a name="MyAnchor2"></a><span class="rvts0"><span class="rvts13">Title 2</span></span></h6>
<p class="rvps3"><span class="rvts8">Some text for Title 2.</span></p>
<p class="rvps3"><span class="rvts8">Some more text for Title 2.</span></p>
<h6 class="rvps5"><a name="MyAnchor3"></a><span class="rvts0"><span class="rvts13">Title 3</span></span></h6>
<p class="rvps3"><span class="rvts8">Some text for Title 3</span></p>
<p class="rvps3"><span class="rvts8"><br/></span></p>
<p class="rvps2" style="clear: both;">
<span class="rvts6">Autogenerated Text</span>
<!-- This anchor should be ignored, because it has no name attribute -->
<a class="rvts7" href="https://www.anywhere.com">Anywhere</a>
</p>
</div>
<!-- The rest of the HTML doesn't matter -->
</div> <!-- /#topic-content -->
</article>
</div> <!-- /#main -->
</body>
</html>
I try to extract the html from MyAnchor1 (including its parent h6 [could be any other element]) to MyAnchor2. From MyAnchor2 to MyAnchor3 and from MyAnchor3 to the end.
First of all i load the file into a HtmlDocument:
htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.Load(refFile);
Then i find the div 'main-content'
var mainContentDiv = htmlDoc.DocumentNode.SelectNodes("//div[contains(@class, 'main-content')]").FirstOrDefault();
And now i struggle, how the get the html between the anchors. I tried Substring, but the positions in the nodes (StartIndex and InnerLength) seems not to match with the string values.
Another approach was to get the anchors itself, but then i don't know how the get the elements until the next anchor (or the end).
One approach that doesn't work:
var anchors = mainContentDiv.SelectNodes(".//a[@name]");
if (anchors != null)
{
foreach (var anchor in anchors)
{
var anchorName = anchor.GetAttributeValue<string>("name", null);
var followingNodes = mainContentDiv.SelectNodes(".//*[preceding::a and following::a[@name = '" + anchorName + "']]");
}
}
Can anyone please help me. Thanks.
Update:
I want to get 3 HTML parts: 1.
<h6 class="rvps5"><a name="MyAnchor1"></a><span class="rvts0"><span class="rvts13">Title 1</span></span></h6>
<p class="rvps3"><span class="rvts8">Some text for Title 1 with inner HTML elements.</span></p>
<h6 class="rvps5"><a name="MyAnchor2"></a><span class="rvts0"><span class="rvts13">Title 2</span></span></h6>
<p class="rvps3"><span class="rvts8">Some text for Title 2.</span></p>
<p class="rvps3"><span class="rvts8">Some more text for Title 2.</span></p>
and 3.
<h6 class="rvps5"><a name="MyAnchor3"></a><span class="rvts0"><span class="rvts13">Title 3</span></span></h6>
<p class="rvps3"><span class="rvts8">Some text for Title 3</span></p>
<p class="rvps3"><span class="rvts8"><br/></span></p>
<p class="rvps2" style="clear: both;">
<span class="rvts6">Autogenerated Text</span>
<!-- This anchor should be ignored, because it has no name attribute -->
<a class="rvts7" href="https://www.anywhere.com">Anywhere</a>
</p>
Working Solution: Finally i have a working solution that consider the unclear structure of the generated html.
var mainContentDiv = htmlDoc.DocumentNode.SelectNodes("//div[contains(@class, 'main-content')]").FirstOrDefault();
var childNodes = mainContentDiv.ChildNodes;
var snippets = new Dictionary<string, string>();
snippets.Add("", mainContentDiv.InnerHtml);
var anchors = mainContentDiv.SelectNodes(".//a[@name]");
if (anchors != null)
{
foreach (var anchor in anchors)
{
var sb = new StringBuilder();
var anchorName = anchor.GetAttributeValue<string>("name", null);
var node = anchor;
while (node.ParentNode.GetAttributeValue<string>("class", null) != "main-content" && node.ParentNode.SelectNodes(".//a[@name]").Count == 1)
{
node = node.ParentNode;
}
sb.Append(node.OuterHtml);
while (node.NextSibling != null)
{
var nodeCollection = node.NextSibling.SelectNodes(".//a[@name]");
if (nodeCollection != null)
break;
node = node.NextSibling;
sb.Append(node.OuterHtml);
}
snippets.Add(anchorName, sb.ToString());
}
}
htmlSnippes.Add(helpContextId, snippets);
Thanks all for helping.
You can try using following code:
List<string> htmlParts = new List<string>();
var anchors = mainContentDiv.SelectNodes(".//a[@name]");
if (anchors != null)
{
foreach (var anchor in anchors)
{
var node = anchor.ParentNode;
StringBuilder sb = new StringBuilder(node.OuterHtml);
while ((node = node.NextSibling) != null)
{
if (node.SelectSingleNode(".//a[@name]") != null)
break;
else
sb.Append(node.OuterHtml);
}
htmlParts.Add(sb.ToString());
}
}
The code assumes that each anchor element always has a parent. You will have to adjust it in case this is not always true.