Removing sections of a large Java String (it contains HTML source)

In my application I'm loading the HTML source of pages into a String. Within this HTML, I want to remove certain pieces of content that are between specific HTML comments.

For example:

//the entire String will be HTML source like this, of the entire page
<div id="someid">
    <a href="#">Some text</a>
    <!-- this_tag_start 123 -->
    <p> This text between the tags to be removed </p>
    <!-- this_tag_end 123 -->
    <a href="#">Some text</a>
</div>

That this_tag_start 123 and the corresponding "end" one are generated by our server. The 123 number will vary.

In my program I have a String containing the entire HTML source. I want to remove the text between those two comment tags (it doesn't matter if the comment tags remain or not). These html comment tags can appear various times throughout the HTML source.

Right now I'm using this regex to remove the content:

htmlString = htmlString.replaceAll(
    "<!-- this_tag_start(.*?)<!-- this_tag_end[\\s\\d]+-->",""
    );

This works and correctly removes these comment tags and the content between the start and end tags. However, it doesn't feel like it's an elegant solution. There should be a better/faster way to do it, right?

If it matters, the String is generated by WebDriver's getPageSource() method.

Solution

1. Elegance

However, it doesn't feel like it's an elegant solution.

Here are two variations of the original regex:

Variation 1

(?s)\s*<!-- this_tag_start([\s\d]+)-->.+?<!-- this_tag_end\1-->\s*

Regular expression visualization

DEMO

This variation uses a backreference for the id. One drawback I see is that this variation allows an id to be whitespaces only. As long as you control the comment this is not a concern.

Variation 2

(?s)\s*<!-- this_tag_start\s+(\d+)\s*-->.+?<!-- this_tag_end\s+\1\s*-->\s*

Regular expression visualization

DEMO

This variation uses again a backreference for the id. However, it is more explicit on how the id is expected: one or more whitespaces, one or more digits followed by zero or more whitespaces.

2. Speed

There should be a better/faster way to do it, right?

Internally, the String#replaceAll method calls Pattern#compile. Pattern compilation is notoriously known for being slow.

I would cache the result of the compilation for faster replacements. Here is how to do it:

public class MyCrawler {
   // Compile once, run multiple times
   private static final Matcher COMMENT_REMOVER = Pattern.compile("the regex here...").matcher("");

   public void doCrawl() {
      String htmlString = loadHtmlSource();

      htmlString = COMMENT_REMOVER.reset(htmlString).replaceAll("");
   }

   ...
}