Search code examples
javaregexscalastack-overflow

Regex for multiline string literals produces `StackOverflowError`


I want to match strings enclosed in triple "-quotes which may contain line breaks, and which don't contain any """-substrings except at the very beginning and in the very end.

Valid example:

"""foo
bar "baz" blah"""

Invalid example:

"""foo bar """ baz"""

I tried using the following regex (as Java String literal):

"(?m)\"\"\"(?:[^\"]|(?:\"[^\"])|(?:\"\"[^\"]))*\"\"\""

and it seems to work on short examples. However, on longer examples, like on a string consisting of thousand lines with hello world, it gives me a StackOverflowError.

Scala snippet to reproduce the error

import java.util.regex.{Pattern, Matcher}

val text = "\"" * 3 + "hello world \n" * 1000 + "\"" * 3
val p = Pattern.compile("(?m)\"\"\"(?:[^\"]|(?:\"[^\"])|(?:\"\"[^\"]))*\"\"\"")
println(p.matcher("\"\"\" foo bar baz \n baz bar foo \"\"\"").lookingAt())
println(p.matcher(text).lookingAt())

(note: test locally, Scastie times out; or maybe reduce 1000 to smaller number?).

Java snippet that produces the same error

import java.util.regex.Pattern;
import java.util.regex.Matcher;

class RegexOverflowMain {
  public static void main(String[] args) {
    StringBuilder bldr = new StringBuilder();
    bldr.append("\"\"\"");
    for (int i = 0; i < 1000; i++) {
      bldr.append("hello world \n");
    }
    bldr.append("\"\"\"");
    String text = bldr.toString();
    Pattern p = Pattern.compile("(?m)\"\"\"(?:[^\"]|(?:\"[^\"])|(?:\"\"[^\"]))*\"\"\"");
    System.out.println(p.matcher("\"\"\" foo bar baz \n baz bar foo \"\"\"").lookingAt());
    System.out.println(p.matcher(text).lookingAt());
  }
}

Question

Any idea how to make this "stack safe", i.e. can someone find a regex that accepts the same language, but does not produce a StackOverflowError when fed to the Java regex API?

I don't care whether the solution is in Scala or Java (or whatever), as long the same underlying Java regex library is used.


Solution

  • Solution using a negative look-ahead to basically find a string that starts with """ and end with """ and contains content that does not include """

    As Plain regex: ^"""((?!""")[\s\S])*"""$

    As Java escaped regex: "^\"\"\"((?!\"\"\")[\\s\\S])*\"\"\"$"

    \s\S includes line-break (its basically . + line-break or . with single line flag)

    This should be used without the multiline flag so that ^ and $ match the start and end of the string and not the start and end of the line

    otherwise this:

    """ ab """abc""" abc """

    would match

    Also i used this as reference for how to exclude the """: Regular expression to match a line that doesn't contain a word?