Search code examples
htmlregextags

Regular expression to find unclosed tag


I am trying to find orphan ' that exists between < and >, whether in the same line or in the closing > next line or after.

I am a bit new to this, I tried lazy search like <.*?'.*>, but I can't get it to work.

Or a different way to search could be to find lines with any odd number of ' between < >.

So on grepWin or NP++ it should match lines like:

<p class="quote" style=' ; dir='ltr'>

But not:

<p class="quote" style='indent' ; dir='ltr'>


Solution

  • You could use this regex to match those tags:

    <(?!(?:[^'">]|'[^']*'|"[^"]*")+>)[^>]*>
    

    It matches:

    • < : literal <
    • (?! : a negative lookahead for
    • (?:[^'"]|'[^']*'|"[^"]*")+> : one or more of
      • [^'">] : a character which is not a single or double quote or a >
      • '[^']*' : a single quoted string
      • "[^"]*" : a double quoted string
    • [^>]*> : some number of not > characters, followed by a >

    The negative lookahead looks for a properly formed tag, where all quotes are balanced. The last part of the regex then matches to the next > after the < which should match the malformed tag.

    Limitations:

    1. If there is a following > inside a properly balanced pair of quotes, the regex will only match as far as that.
    2. If there is a < inside a pair of quotes, this regex may match from that point.

    Regex demo on regex101