Search code examples
c#htmlxmlregexxhtml

Removing unclosed opening <p>tags from xhtml document


I have a big xhtml document with lots of tags. I have observed that a few unclosed opening paragraph tags are repeating unnecessarily and I want to remove them or replace them with blank space. i just want to code to identify unclosed paragraph tags and delete them.

Here's a small sample to show what I mean:

<p><strong>Company Registration No.1</strong> </p>
<p><strong>Company Registration No.2</strong></p>

<p>      <!-- extra tag -->
<p>      <!-- extra tag -->

<hr/>     

<p><strong> HALL WOOD (LEEDS) LIMITED</strong><br/></p>
<p><strong>REPORT AND FINANCIAL STATEMENTS </strong></p>

Can some one please give me code for console application, just to remove these unclosed paragraph tags.


Solution

  • this should work:

    public static class XHTMLCleanerUpperThingy
    {
        private const string p = "<p>";
        private const string closingp = "</p>";
    
        public static string CleanUpXHTML(string xhtml)
        {
            StringBuilder builder = new StringBuilder(xhtml);
            for (int idx = 0; idx < xhtml.Length; idx++)
            {
                int current;
                if ((current = xhtml.IndexOf(p, idx)) != -1)
                {
                    int idxofnext = xhtml.IndexOf(p, current + p.Length);
                    int idxofclose = xhtml.IndexOf(closingp, current);
    
                    // if there is a next <p> tag
                    if (idxofnext > 0)
                    {
                        // if the next closing tag is farther than the next <p> tag
                        if (idxofnext < idxofclose)
                        {
                            for (int j = 0; j < p.Length; j++)
                            {
                                builder[current + j] = ' ';
                            }
                        }
                    }
                    // if there is not a final closing tag
                    else if (idxofclose < 0)
                    {
                        for (int j = 0; j < p.Length; j++)
                        {
                            builder[current + j] = ' ';
                        }
                    }
                }
            }
    
            return builder.ToString();
        }
    }
    

    I have tested it with your sample example and it works...although it is a bad formula for an algorithm, it should give you a starting basis!