Search code examples
c#asp.nethtml-parsinghtml-agility-pack

Agility Helper Html retrieving p/paragraphs text until another anchor is reached


I am using Agility Helper HTML and I have thus far a code as such:

        var linkWeb = new HtmlWeb();
        var linkDoc = web.Load(link);
        foreach (HtmlNode l in linkDoc.DocumentNode.SelectNodes("//p"))
        {
            Console.WriteLine("text #"+ i++= + l.InnerText);
        }

So this reads the web paragraph text just fine except, I want it to read all the paragraphs text combined until another anchor a tag is reached or if you can think of a better method.

<p>
<a href="1.shtml#Top" target="_top">PART 1</a>
CONTENT1;
CONTENT2;
</p>
<p>CONTENT3.</p>

<p>
<a href="2.shtml#Top" target="_top">PART 2</a>
CONTENT1&nbsp;
CONTENT2&nbsp;
CONTENT3&nbsp;
CONTENT4
</p>
<p>CONTENT5.</p>
<p>CONTENT6.</p>
<p>CONTENT8.</p>

<p>
<a href="3.shtml#Top" target="_top">PART 3</a>
CONTENT1&nbsp;
CONTENT2&nbsp;
CONTENT3&nbsp;
CONTENT4.
</p>

So right now with the code I have, it reads the P text of each paragraph separately.

TEXT #1 is

CONTENT1 CONTENT2

TEXT # 2 is CONTENT3.

I want this to read TEXT #1 is CONTENT1 CONTENT2 CONTENT3.

this is dynamic and # of paragraphs change.

Some kind of check to make sure before hitting the anchor it reads all paragraphs / InnerTexts and knows it is the supposed to be in the same Text #.


Solution

  • You could implement this like:

        foreach (HtmlNode l in linkDoc.DocumentNode.SelectNodes("//p"))
        {
            if (l.ChildNodes.Any(node => node.Name == "a"))
            {
                Console.WriteLine();
                Console.Write("text #" + i++);
            }
            Console.Write(l.InnerText + " ");
        }