Search code examples
javaregexstringparsingreplaceall

Removing html tags with regex (Java)


Say I've read html input into a string and then do:

    content = content.replaceAll("<[^>]*[^\\s>][^>]*>", "");

Right now, this removes all html tags except those that look like:

    <>

and

    < (any amount of white space) >

but I'd also like to include tags that contain non-printable characters to that list of exceptions. Is there anyway I can modify the replaceAll regular experssion to accomplish that? If so, how? Thanks for any input/suggestions.


Solution

  • You can use this pattern:

    <[^\\P{Graph}>]+>
    

    \\P{Graph} is a character class that only contains whitespaces and control characters (this is the negation of \\p{Graph})