Search code examples
regexapex-code

Regex pattern to replace html in a given text string


I am trying to extract the text from the below html snippet. Need help in regex pattern that will replace all the html tag and only will leave out the content.

I tried to remove the <span*> using the below expression but that didn't do the trick.

 String x = '<span style="font-size:11pt;"><span style="line-height:107%;"><span style="font-family:Calibri, sans-serif;"><strong><font color="#000000">Some normal text here...</font></strong></span></span></span>';
 String y = x.replaceAll('[<span*\b>]','');
 system.debug(y);

This prints out:

  tyle="fot-ize:11t;" tyle="lie-height:107%;" tyle="fot-fmily:Clibri, -erif;"trogfot color="#000000"Some normal text here.../fot/trog///

So it basically replaced the each character individually and not the content between the <span ... >

Need Help


Solution

  • The second line of code should be:

    String y = x.replaceAll('<span[^>]*>','');
    

    The meaning of this statement is: for all the occurrences of '<span' followed by many occurences (*) of anything but '>' ([^>]) followed by a single '>', replace by nothing.

    By the way, you will miss the closing tab </span>. I tell this just for your information, because you didn't ask for this.