Search code examples
javascripthtmlregexecmascript-5

Remove all HTML tags from a html body except <a>, <br>, <b> and <img>


When reading some email HTML body, I often have lots of HTML tags, that I don't want anymore.

How to remove from a string, in Javascript, all HTML tags like:

<anything ...>

or

</anything>

except these few cases <x ...>, </x>, <x ... /> for x being:

  • a
  • br
  • b
  • img

I thought about something like:

s.replace(/<[^a].*>/g, '');

but I'm not sure how to do it.

Example:

<div id="hello">Hello</div><a href="test">Youhou</a>` 

should become

Hello<a href="test">Youhou</a>

Note: I'm looking for a few lines-of-code solution that would work for 90% of the times (the email body comes from my own emails, so I didn't include anything malicious), not for a full solution that would require third-party tool/library.


Solution

  • Try replacing

    <\/?(?!(a|br|b|img)\b)\w+[^>]*>
    

    with nothing.

    <\/? Match the start <, optionally followed by a /

    (?!(a|br|b|img)\b) Negative look-ahead ensuring we don't match a, br, b or img tags.

    \w+[^>]*> Match the rest of the tag.

    Here at regex101.