Search code examples
htmlregexparsingregular-language

Regex: Negate capture group with logical or


I'm trying to use regex to filter forbidden HTML tags out of a given string. Yes I know, I'm supposed to use a parser instead but for this specific problem it's faster this way.

The idea is to whitelist every tag which is okay (e.g. <span>, <b>, </br>) and match forbidden ones. So far I came up with the following expression: <\/?(?!(span|b|br)).\>

It works well for single char tags like <a> but stuff like <label> does not work. I'd really appreciate some help, thanks in advance.


Solution

  • This regex will get tags while ignoring the span, br, b opening and closing tags.

    It should even ignore those from the white list if they contain attributes.

    <\/?(?!(?:span|br|b)(?: [^>]*)?>)[^>\/]*>