First off, I understand that it is not ideal to parse html with regex. However, I'm close to the solution I need, and I just can't quite get it right.
Say you have html input in a string and you do:
content = content.replaceAll("<[^\\P{Graph}>]+>", "");
This will essentially remove html tags except those with non-printable characters, space, tab, newline, and control characters.
This is fine, except that there is a problem with the space character. I need to replaceAll
tags that look like:
<ht ml> (space somewhere in the middle)
but keep those that look like:
< html> (because this one contains a space as the FIRST character).
How can I adjust my regular expression for replaceAll()
to accomplish this? Thanks for any input/suggestions.
This should do the trick. Place an optional group after your negated class.
content = content.replaceAll("<[^\\P{Graph}>]+(?: [^\\P{Graph}>]*)*>", "");
Since you're first checking for these characters after the opening bracket, this may suffice as well.
content = content.replaceAll("<[^\\P{Graph}>]+(?: [^>]*)?>", "");