How can I remove multiple "empty p tags" or "p tags containing a non breaking space" or a "p tag containing a line break" and replace with a "single p tag containing a line break", I assume using something like HTML Agility pack is a better solution than Regex but I am open to suggestions.
For example the following HTML:
<p>Test</p><p> </p><p> </p><p></p><p></p><p> </p><p>Test 2</p>
Or the following more complex example:
<p>Test</p><p> </p><p><br/></p><p><p></p><br data-mce-bogus="1"></p><p></p><p>Test 2</p>
Would get replaced with the following:
<p>Test</p><p><br></p><p>Test 2</p>
So effectively anything that could cause multiple line breaks in the HTML code would get replaced with just a single line break.
The HTML can be added and edited from multiple sources (i.e. web application, iOS App, Android App) and multiple rich text editor types so the way the line breaks have been added are not necessarily consistent hence needing to find and replace multiple types of line break with a single one using
With a little bit of help from Chat GPT I have come up with the following code:
// Load the HTML document
var doc = new HtmlDocument();
doc.LoadHtml(value);
// Select all the p tags
var pTags = doc.DocumentNode.SelectNodes("//p");
// If no p tags found then return the value
if (pTags == null || pTags.Count <= 0)
return value;
// Iterate p tags
for (int i = 0; i < pTags.Count; i++)
{
// Check if current p tag
if (pTags[i].InnerHtml.Trim() == " " || // Contains only a
String.IsNullOrWhiteSpace(pTags[i].InnerHtml) || // Or whitespace
(pTags[i].ChildNodes.Any(x => x.Name == "br") && pTags[i].ChildNodes.Where(x => x.Name != "br").All(x => x.InnerHtml.Trim() == " " || String.IsNullOrWhiteSpace(x.InnerHtml)))) // Or contains only a "br" (and possibly whitespace either side)
{
// Change to a break
pTags[i].InnerHtml = "<br>";
}
else
continue;
// If this is not the first p tag
if (i > 0)
{
// Check if current tag and previous tag both contain a line break and if so then remove current tag
if (pTags[i].InnerHtml == "<br>" && pTags[i - 1].InnerHtml == "<br>")
doc.DocumentNode.RemoveChild(pTags[i]);
}
}
// Return the modified html
return doc.DocumentNode.OuterHtml;