Search code examples
c#.nethtml-agility-pack

Getting content between two HTML tags using Html Agility Pack


We have an absolutely massive help document created in Word and this was used to generate an even more massive and unweildly HTM document. Using C# and this library, I want to only grab and display one section of this file at any point in my application. Sections are split up like this:

<!--logical section starts here -->
<div>
<h1><span style='mso-spacerun:yes'></span><a name="_Toc325456104">Section A</a></h1>
</div>
 <div> Lots of unnecessary markup for simple formatting... </div>
 .....
<!--logical section ends here -->

<div>
<h1><span style='mso-spacerun:yes'></span><a name="_Toc325456104">Section B</a></h1>
</div>

Logically speaking, there is an H1 with a section name in an a tag. I want to select everything from the outer containing div until I encounter another h1 and exclude that div.

  • Each Section Name is located in a <a> tag under an h1 which has multiple children (about 6 each)
  • The logical section is marked with comments
  • These comments do not exist in the actual document

My attempt:

var startNode = helpDocument.DocumentNode.SelectSingleNode("//h1/a[contains(., '"+sectionName+"')]");
//go up one level from the a node to the h1 element
startNode=startNode.ParentNode;

//get the start index as the index of the div containing the h1 element
int startNodeIndex = startNode.ParentNode.ChildNodes.IndexOf(startNode);

//here I am not sure how to get the endNode location. 
var endNode =?;

int endNodeIndex = endNode.ParentNode.ChildNodes.IndexOf(endNode);

//select everything from the start index to the end index
var nodes = startNode.ParentNode.ChildNodes.Where((n, index) => index >= startNodeIndex && index <= endNodeIndex).Select(n => n);

Sine I haven't been able to find documentation on this, I don't know how I can get from my start node to the next h1 element. Any suggestions would be appreciated.


Solution

  • I think this'll do it, though it assumes that H1 tags only appear in section heads. If that's not the case, you can add a Where on the descendants to check for other filters on any H1 nodes it finds. Note that this will include all siblings of the div it finds until it comes to the next one with a section name.

    private List<HtmlNode> GetSection(HtmlDocument helpDocument, string SectionName)
    {
        HtmlNode startNode = helpDocument.DocumentNode.Descendants("div").Where(d => d.InnerText.Equals(SectionName, StringComparison.InvariantCultureIgnoreCase)).FirstOrDefault();
        if (startNode == null)
            return null; // section not found
    
        List<HtmlNode> section = new List<HtmlNode>();
        HtmlNode sibling = startNode.NextSibling;
        while (sibling != null && sibling.Descendants("h1").Count() <= 0)
        {
            section.Add(sibling);
            sibling = sibling.NextSibling;
        }
    
        return section;
    }