I want to match strings enclosed in triple "
-quotes which may contain line breaks, and which don't contain any """
-substrings except at the very beginning and in the very end.
Valid example:
"""foo
bar "baz" blah"""
Invalid example:
"""foo bar """ baz"""
I tried using the following regex (as Java String
literal):
"(?m)\"\"\"(?:[^\"]|(?:\"[^\"])|(?:\"\"[^\"]))*\"\"\""
and it seems to work on short examples. However, on longer examples, like on a string consisting of thousand lines with hello world
, it gives me a StackOverflowError
.
Scala snippet to reproduce the error
import java.util.regex.{Pattern, Matcher}
val text = "\"" * 3 + "hello world \n" * 1000 + "\"" * 3
val p = Pattern.compile("(?m)\"\"\"(?:[^\"]|(?:\"[^\"])|(?:\"\"[^\"]))*\"\"\"")
println(p.matcher("\"\"\" foo bar baz \n baz bar foo \"\"\"").lookingAt())
println(p.matcher(text).lookingAt())
(note: test locally, Scastie times out; or maybe reduce 1000 to smaller number?).
Java snippet that produces the same error
import java.util.regex.Pattern;
import java.util.regex.Matcher;
class RegexOverflowMain {
public static void main(String[] args) {
StringBuilder bldr = new StringBuilder();
bldr.append("\"\"\"");
for (int i = 0; i < 1000; i++) {
bldr.append("hello world \n");
}
bldr.append("\"\"\"");
String text = bldr.toString();
Pattern p = Pattern.compile("(?m)\"\"\"(?:[^\"]|(?:\"[^\"])|(?:\"\"[^\"]))*\"\"\"");
System.out.println(p.matcher("\"\"\" foo bar baz \n baz bar foo \"\"\"").lookingAt());
System.out.println(p.matcher(text).lookingAt());
}
}
Question
Any idea how to make this "stack safe", i.e. can someone find a regex that accepts the same language, but does not produce a StackOverflowError
when fed to the Java regex API?
I don't care whether the solution is in Scala or Java (or whatever), as long the same underlying Java regex library is used.
Solution using a negative look-ahead to basically find a string that starts with """
and end with """
and contains content that does not include """
As Plain regex: ^"""((?!""")[\s\S])*"""$
As Java escaped regex: "^\"\"\"((?!\"\"\")[\\s\\S])*\"\"\"$
"
\s\S
includes line-break (its basically .
+ line-break or .
with single line flag)
This should be used without the multiline flag so that ^
and $
match the start and end of the string and not the start and end of the line
otherwise this:
""" ab
"""abc"""
abc """
would match
Also i used this as reference for how to exclude the """
: Regular expression to match a line that doesn't contain a word?