In my application I'm loading the HTML source of pages into a String. Within this HTML, I want to remove certain pieces of content that are between specific HTML comments.
For example:
//the entire String will be HTML source like this, of the entire page
<div id="someid">
<a href="#">Some text</a>
<!-- this_tag_start 123 -->
<p> This text between the tags to be removed </p>
<!-- this_tag_end 123 -->
<a href="#">Some text</a>
</div>
That this_tag_start 123
and the corresponding "end" one are generated by our server. The 123
number will vary.
In my program I have a String containing the entire HTML source. I want to remove the text between those two comment tags (it doesn't matter if the comment tags remain or not). These html comment tags can appear various times throughout the HTML source.
Right now I'm using this regex to remove the content:
htmlString = htmlString.replaceAll(
"<!-- this_tag_start(.*?)<!-- this_tag_end[\\s\\d]+-->",""
);
This works and correctly removes these comment tags and the content between the start and end tags. However, it doesn't feel like it's an elegant solution. There should be a better/faster way to do it, right?
If it matters, the String is generated by WebDriver's getPageSource() method.
However, it doesn't feel like it's an elegant solution.
Here are two variations of the original regex:
(?s)\s*<!-- this_tag_start([\s\d]+)-->.+?<!-- this_tag_end\1-->\s*
This variation uses a backreference for the id. One drawback I see is that this variation allows an id to be whitespaces only. As long as you control the comment this is not a concern.
(?s)\s*<!-- this_tag_start\s+(\d+)\s*-->.+?<!-- this_tag_end\s+\1\s*-->\s*
This variation uses again a backreference for the id. However, it is more explicit on how the id is expected: one or more whitespaces, one or more digits followed by zero or more whitespaces.
There should be a better/faster way to do it, right?
Internally, the String#replaceAll
method calls Pattern#compile
. Pattern compilation is notoriously known for being slow.
I would cache the result of the compilation for faster replacements. Here is how to do it:
public class MyCrawler {
// Compile once, run multiple times
private static final Matcher COMMENT_REMOVER = Pattern.compile("the regex here...").matcher("");
public void doCrawl() {
String htmlString = loadHtmlSource();
htmlString = COMMENT_REMOVER.reset(htmlString).replaceAll("");
}
...
}