In trying to match all multiline comments in a Java source file I run into a StackOverflow()
error. It happens when the matched comment is pretty large. I've managed to more or less pinpoint the limit to 2500 characters, but this might be specific to my environment.
I'm using the following expression to match the comments:
/<comment:((\/\*([^*]|[\r\n]|(\*+([^*\/]|[\r\n])))*\*+\/))+>/mi
Is there some limit to the size of the match I should be aware of, or is there a flaw in my regex?
My stacktrace is:
|project://Sevo1/src/Volume.rsc|(985,32,<53,12>,<53,44>): StackOverflow()
at countLines(|project://Sevo1/src/Volume.rsc|(985,33,<53,12>,<53,45>))
at $root$(|prompt:///|(0,73,<1,0>,<1,73>))
Your regex is not optimal as it contains a *
-quantified capturing group that contains alternatives matching at the same locations inside the string. You may see that [^*]
matches any char but *
(i.e. it matches line breaks), and then you have [\r\n]
that also matches line breaks. Note that the chunks of text you match are mostly 1-char long (except for *
chunks matched with (\*+([^*\/]|[\r\n]))
), and the regex engine just does not seem to cope with that task well here.
Nested quantifiers are only good when you match longer chunks at one go. Re-write the pattern as
/<comment:\/\*[^*]*\*+(?:[^\/*][^*]*\*+)*\/>/
and it will be more efficient. See the regex demo.
Details
\/\*
- a /*
substring[^*]*\*+
- 0+ characters other than *
followed with one or more literal *
(?:[^\/*][^*]*\*+)*
- 0+ sequences of:
[^\/*][^*]*\*+
- not a /
or *
(matched with [^/*]
) followed with 0+ non-asterisk characters ([^*]*
) followed with one or more asterisks (\*+
)\/
- closing /