Search code examples
c#htmlhtml-agility-packline-breaks

Replace multiple p tags containing line breaks or non breaking spaces with a single line break using HTML Agility Pack


How can I remove multiple "empty p tags" or "p tags containing a non breaking space" or a "p tag containing a line break" and replace with a "single p tag containing a line break", I assume using something like HTML Agility pack is a better solution than Regex but I am open to suggestions.

For example the following HTML:

<p>Test</p><p>&nbsp;</p><p>&nbsp;</p><p></p><p></p><p>&nbsp;</p><p>Test 2</p>

Or the following more complex example:

<p>Test</p><p>&nbsp;</p><p><br/></p><p><p></p><br data-mce-bogus="1"></p><p></p><p>Test 2</p>

Would get replaced with the following:

<p>Test</p><p><br></p><p>Test 2</p>

So effectively anything that could cause multiple line breaks in the HTML code would get replaced with just a single line break.

The HTML can be added and edited from multiple sources (i.e. web application, iOS App, Android App) and multiple rich text editor types so the way the line breaks have been added are not necessarily consistent hence needing to find and replace multiple types of line break with a single one using



Solution

  • With a little bit of help from Chat GPT I have come up with the following code:

    // Load the HTML document
    var doc = new HtmlDocument();
    doc.LoadHtml(value);
    
    // Select all the p tags
    var pTags = doc.DocumentNode.SelectNodes("//p");
    
    // If no p tags found then return the value
    if (pTags == null || pTags.Count <= 0)
        return value;
    
    // Iterate p tags
    for (int i = 0; i < pTags.Count; i++)
    {
        // Check if current p tag  
        if (pTags[i].InnerHtml.Trim() == "&nbsp;" || // Contains only a &nbsp;
            String.IsNullOrWhiteSpace(pTags[i].InnerHtml) || // Or whitespace
            (pTags[i].ChildNodes.Any(x => x.Name == "br") && pTags[i].ChildNodes.Where(x => x.Name != "br").All(x => x.InnerHtml.Trim() == "&nbsp;" || String.IsNullOrWhiteSpace(x.InnerHtml)))) // Or contains only a "br" (and possibly whitespace either side)
        {
            // Change to a break
            pTags[i].InnerHtml = "<br>";
        }
        else
            continue;
    
        // If this is not the first p tag
        if (i > 0)
        {
            // Check if current tag and previous tag both contain a line break and if so then remove current tag
            if (pTags[i].InnerHtml == "<br>" && pTags[i - 1].InnerHtml == "<br>")
                doc.DocumentNode.RemoveChild(pTags[i]);
        }
    }
    
    // Return the modified html
    return doc.DocumentNode.OuterHtml;