Search code examples
javaregexpattern-matchingstack-overflowkey-value

Java Pattern causes stack overflow


I am using a regular expression to extract key-value pairs from arbitrarily long input strings and have run into a case in which, for a long string with repetitive patterns, it causes a stack overflow.

The KV-parsing code looks something like this:

public static void parse(String input)
{
    String KV_REGEX = "((?:\"[^\"^ ]*\"|[^=,^ ])*) *= *((?:\"[^\"]*\"|[^=,^\\)^ ])*)";
    Pattern KV_PATTERN = Pattern.compile(KV_REGEX);

    Matcher matcher = KV_PATTERN.matcher(input);

    System.out.println("\nMatcher groups discovered:");

    while (matcher.find())
    {
        System.out.println(matcher.group(1) + ", " + matcher.group(2));
    }
}

Some fictitious examples of output:

    String input1 = "2012-08-09 09:10:25,521 INFO com.a.package.SomeClass - Everything working fine {name=CentOS, family=Linux, category=OS, version=2.6.x}";
    String input2 = "2012-08-09 blah blah 09:12:38,462 Log for the main thread, PID=5872, version=\"7.1.8.x\", build=1234567, other=done";

Calling parse(input1) produces:

{name, CentOS
family, Linux
category, OS
version, 2.6.x}

Calling parse(input2) produces:

PID, 5872
version, "7.1.8.x"
build, 1234567
other, done

This is fine (even with a bit of string processing required for the first case). However, when trying to parse a very long (over 1,000 characters long) classpath string, the aforementioned class overflow occurs, with the following exception (start):

Exception in thread "main" java.lang.StackOverflowError
    at java.util.regex.Pattern$BitClass.isSatisfiedBy(Pattern.java:2927)
    at java.util.regex.Pattern$8.isSatisfiedBy(Pattern.java:4783)
    at java.util.regex.Pattern$8.isSatisfiedBy(Pattern.java:4783)
    at java.util.regex.Pattern$8.isSatisfiedBy(Pattern.java:4783)
    at java.util.regex.Pattern$8.isSatisfiedBy(Pattern.java:4783)
    at java.util.regex.Pattern$CharProperty.match(Pattern.java:3345)
    ...

The string is too long to put here, but it has the following, easily reproducible and repetitive structure:

java.class.path=/opt/files/any:/opt/files/any:/opt/files/any:/opt/files/any

Anyone who wants to reproduce the issue just needs to append :/opt/files/any a few dozen times to the above string. After creating a string with about 90 copies of ":/opt/files/any" present in the classpath string, the stack overflow occurs.

Is there a generic way that the above KV_REGEX string could be modified, so that the issue does not occur and the same results are produced?

I explicitly put generic above, as opposed to hacks that (for instance) check for a maximum string length before parsing.

The most gross fix I could come up with, a true anti-pattern, is

public void safeParse(String input)
{
    try
    {
        parse(input);
    }
    catch (StackOverflowError e) // Or even Throwable!
    {
        parse(input.substring(0, MAX_LENGTH));
    }
}

Funnily enough, it works in a few runs I tried it, but it is not something tasteful enough to recommend. :-)


Solution

  • Your regex looks overly complicated, for example I think you haven't quite understood how character classes work. This works better for me, I can't make it overflow anymore:

    public static void parse(String input) {
        String KV_REGEX = "(\"[^\" ]*\"|[^{=, ]*) *= *(\"[^\"]*\"|[^=,) }]*)";
        Pattern KV_PATTERN = Pattern.compile(KV_REGEX);
    
        Matcher matcher = KV_PATTERN.matcher(input);
    
        System.out.println("\nMatcher groups discovered:");
    
        while (matcher.find()) {
            System.out.println(matcher.group(1) + ", " + matcher.group(2));
        }
    }
    

    To break down the regex, this will match:

    (\"[^\" ]*\"|[^{=, ]*): Anything enclosed with "s, or any number of non-{=, characters

    *= *: zero to any number of spaces, followed by =, followed by zero to any number of spaces

    (\"[^\"]*\"|[^=,) }]*): Anything enclosed with "s, or any number of non-=,) } characters