Search code examples
javahtmlparsingtagsreplaceall

How to remove certain html tags from a String with replaceAll?


I have a string including different kinds of html tags.

I want to remove all <a> and </a> tags.

I tried:

string.replaceAll("<a>", "");
string.replaceAll("</a>", "");

But it doesn't work. Those tags still remain in the string. Why?


Solution

  • Those tags still remain in the string. Why?

    Because replaceAll doesn't modify the string directly (it can't, strings are immutable), it returns the modified string. So:

    string = string.replaceAll("<a>", "");
    string = string.replaceAll("</a>", "")
    

    Live Example

    Or

    string = string.replaceAll("<a>", "").replaceAll("</a>", "")
    

    Note that replaceAll takes a string defining a regular expression as its first argument. "<a>" and "</a>" are both fine, but unless you need to use a regular expression, use replace(CharSequence,CharSequence) instead. If using replaceAll, just be aware of the characters with special meaning in regular expressions.

    In fact, you can do it with one replaceAll by making use of the fact you're using regular expressions:

    string = string.replaceAll("</?a>", "");
    

    The ? after the / makes the / optional, so that'll replace "<a>" and "</a>".

    Live Example