Search code examples
javaregexstringstack-overflow

Regex Patterns causing StackoverFlow


I'm working on a project in JAVA8 where I'd like to get an HTML file from either a directory or a link, and remove all style and script tags from the file and return what is left. This is being performed iteratively on a very large number of files.

Right now these are the two different regex patterns I'm using to remove the specified tags.

//remove style tags and style tag content
update = update.replaceAll("<style\\b[^<]*(?:(?!</style>)<[^<]*)*</style>", "");

//remove script tags and script tag content
update = update.replaceAll("<script[\\s\\S]*?>[\\s\\S]*?</script>", "");

This works for a period of time, but it seems that occasionally I'll come across a java.lang.StackOverflowError.

I believe that this happens when the file is too large. I've done some research and found that this can happen if you use "|" in your pattern, because this operator uses recursion which can be memory intensive depending on how many levels are traversed.

I've managed to iteratively use these patterns on different test files up to 1000s of times.

My question is: does anyone see that these patterns would be using recursion? or anything that would suggest the pattern itself is whats causing the overflow?

If not, perhaps there's a way for me to reduce the string down to a size which wouldn't cause this overload.

Using print statements it seems that the overflow may be happening when trying to match the pattern:

"<script[\\s\\S]*?>[\\s\\S]*?</script>"

Additionally, I was told I could use this instead:

"<script[\\s\\S]+?>[\\s\\S]+?</script>"

Because this doesn't look ahead as far. This pattern works in Regexr but did not give the same output once implemented in the JAVA application.

Here is the stack trace I receive:

Exception in thread "main" java.lang.StackOverflowError
at java.util.regex.Pattern$Curly.match0(Pattern.java:4252)
at java.util.regex.Pattern$Curly.match(Pattern.java:4236)
at java.util.regex.Pattern$BmpCharProperty.match(Pattern.java:3800)
at java.util.regex.Pattern$Neg.match(Pattern.java:5099)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4660)
at java.util.regex.Pattern$Loop.match(Pattern.java:4787)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4719)
at java.util.regex.Pattern$Curly.match0(Pattern.java:4274)

I'm open to any and all advice. Thank you in advanced.


Solution

  • I ended up using a combination of both answers from VGR and MatthewGreen. Re2j solved my regex problem and increased the performance of the matching. - ultimately I decided to depend less on regex for this and instead use JSoup for parsing and regex to extract what I wanted from the document after removing the unwanted elements.