Search code examples
c#html-agility-pack

HTMLAgilityPack get InnerText but keep <a> elements


My original text is something like this:

<p>
Lorem <span><i></i>ipsum<span> <a href="">dolor</a> sit <a href="link.html">amet</a>, consectetur adipiscing <span>elit</span>.
</p>

I am trying to keep only text + A elements, so the output should be something like this:

Lorem ipsum <a href="">dolor</a> sit <a href="link.html">amet</a>, consectetur adipiscing elit.

Both

htmlDoc.DocumentNode.SelectSingleNode("//p").InnerText;

and

htmlDoc.DocumentNode.SelectSingleNode("//p").InnerHtml;

are not working for this case. How can I achieve that?


Solution

  • I've achieved that with regex, I hope it will help someone in the future:

    var output = Regex.Replace(input, @"<(?!\/?a(?=>|\s.*>))\/?.*?>", string.Empty);
    

    Don't forget to add

    using System.Text.RegularExpressions;