We have an absolutely massive help document created in Word and this was used to generate an even more massive and unweildly HTM document. Using C# and this library, I want to only grab and display one section of this file at any point in my application. Sections are split up like this:
<!--logical section starts here -->
<div>
<h1><span style='mso-spacerun:yes'></span><a name="_Toc325456104">Section A</a></h1>
</div>
<div> Lots of unnecessary markup for simple formatting... </div>
.....
<!--logical section ends here -->
<div>
<h1><span style='mso-spacerun:yes'></span><a name="_Toc325456104">Section B</a></h1>
</div>
Logically speaking, there is an H1
with a section name in an a
tag. I want to select everything from the outer containing div until I encounter another h1
and exclude that div.
<a>
tag under an h1
which has multiple children (about 6 each)My attempt:
var startNode = helpDocument.DocumentNode.SelectSingleNode("//h1/a[contains(., '"+sectionName+"')]");
//go up one level from the a node to the h1 element
startNode=startNode.ParentNode;
//get the start index as the index of the div containing the h1 element
int startNodeIndex = startNode.ParentNode.ChildNodes.IndexOf(startNode);
//here I am not sure how to get the endNode location.
var endNode =?;
int endNodeIndex = endNode.ParentNode.ChildNodes.IndexOf(endNode);
//select everything from the start index to the end index
var nodes = startNode.ParentNode.ChildNodes.Where((n, index) => index >= startNodeIndex && index <= endNodeIndex).Select(n => n);
Sine I haven't been able to find documentation on this, I don't know how I can get from my start node to the next h1 element. Any suggestions would be appreciated.
I think this'll do it, though it assumes that H1 tags only appear in section heads. If that's not the case, you can add a Where on the descendants to check for other filters on any H1 nodes it finds. Note that this will include all siblings of the div it finds until it comes to the next one with a section name.
private List<HtmlNode> GetSection(HtmlDocument helpDocument, string SectionName)
{
HtmlNode startNode = helpDocument.DocumentNode.Descendants("div").Where(d => d.InnerText.Equals(SectionName, StringComparison.InvariantCultureIgnoreCase)).FirstOrDefault();
if (startNode == null)
return null; // section not found
List<HtmlNode> section = new List<HtmlNode>();
HtmlNode sibling = startNode.NextSibling;
while (sibling != null && sibling.Descendants("h1").Count() <= 0)
{
section.Add(sibling);
sibling = sibling.NextSibling;
}
return section;
}