Removing html tags with regex in Java

First off, I understand that it is not ideal to parse html with regex. However, I'm close to the solution I need, and I just can't quite get it right.

Say you have html input in a string and you do:

content = content.replaceAll("<[^\\P{Graph}>]+>", "");

This will essentially remove html tags except those with non-printable characters, space, tab, newline, and control characters.

This is fine, except that there is a problem with the space character. I need to replaceAll tags that look like:

<ht ml> (space somewhere in the middle)

but keep those that look like:

< html> (because this one contains a space as the FIRST character).

How can I adjust my regular expression for replaceAll() to accomplish this? Thanks for any input/suggestions.

Solution

This should do the trick. Place an optional group after your negated class.

content = content.replaceAll("<[^\\P{Graph}>]+(?: [^\\P{Graph}>]*)*>", "");

Since you're first checking for these characters after the opening bracket, this may suffice as well.

content = content.replaceAll("<[^\\P{Graph}>]+(?: [^>]*)?>", "");