I am using a regular expression to extract key-value pairs from arbitrarily long input strings and have run into a case in which, for a long string with repetitive patterns, it causes a stack overflow.
The KV-parsing code looks something like this:
public static void parse(String input)
{
String KV_REGEX = "((?:\"[^\"^ ]*\"|[^=,^ ])*) *= *((?:\"[^\"]*\"|[^=,^\\)^ ])*)";
Pattern KV_PATTERN = Pattern.compile(KV_REGEX);
Matcher matcher = KV_PATTERN.matcher(input);
System.out.println("\nMatcher groups discovered:");
while (matcher.find())
{
System.out.println(matcher.group(1) + ", " + matcher.group(2));
}
}
Some fictitious examples of output:
String input1 = "2012-08-09 09:10:25,521 INFO com.a.package.SomeClass - Everything working fine {name=CentOS, family=Linux, category=OS, version=2.6.x}";
String input2 = "2012-08-09 blah blah 09:12:38,462 Log for the main thread, PID=5872, version=\"7.1.8.x\", build=1234567, other=done";
Calling parse(input1)
produces:
{name, CentOS
family, Linux
category, OS
version, 2.6.x}
Calling parse(input2)
produces:
PID, 5872
version, "7.1.8.x"
build, 1234567
other, done
This is fine (even with a bit of string processing required for the first case). However, when trying to parse a very long (over 1,000 characters long) classpath string, the aforementioned class overflow occurs, with the following exception (start):
Exception in thread "main" java.lang.StackOverflowError
at java.util.regex.Pattern$BitClass.isSatisfiedBy(Pattern.java:2927)
at java.util.regex.Pattern$8.isSatisfiedBy(Pattern.java:4783)
at java.util.regex.Pattern$8.isSatisfiedBy(Pattern.java:4783)
at java.util.regex.Pattern$8.isSatisfiedBy(Pattern.java:4783)
at java.util.regex.Pattern$8.isSatisfiedBy(Pattern.java:4783)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3345)
...
The string is too long to put here, but it has the following, easily reproducible and repetitive structure:
java.class.path=/opt/files/any:/opt/files/any:/opt/files/any:/opt/files/any
Anyone who wants to reproduce the issue just needs to append :/opt/files/any
a few dozen times to the above string. After creating a string with about 90 copies of ":/opt/files/any" present in the classpath string, the stack overflow occurs.
Is there a generic way that the above KV_REGEX
string could be modified, so that the issue does not occur and the same results are produced?
I explicitly put generic above, as opposed to hacks that (for instance) check for a maximum string length before parsing.
The most gross fix I could come up with, a true anti-pattern, is
public void safeParse(String input)
{
try
{
parse(input);
}
catch (StackOverflowError e) // Or even Throwable!
{
parse(input.substring(0, MAX_LENGTH));
}
}
Funnily enough, it works in a few runs I tried it, but it is not something tasteful enough to recommend. :-)
Your regex looks overly complicated, for example I think you haven't quite understood how character classes work. This works better for me, I can't make it overflow anymore:
public static void parse(String input) {
String KV_REGEX = "(\"[^\" ]*\"|[^{=, ]*) *= *(\"[^\"]*\"|[^=,) }]*)";
Pattern KV_PATTERN = Pattern.compile(KV_REGEX);
Matcher matcher = KV_PATTERN.matcher(input);
System.out.println("\nMatcher groups discovered:");
while (matcher.find()) {
System.out.println(matcher.group(1) + ", " + matcher.group(2));
}
}
To break down the regex, this will match:
(\"[^\" ]*\"|[^{=, ]*)
: Anything enclosed with "
s, or any number of non-{=,
characters
*= *
: zero to any number of spaces, followed by =
, followed by zero to any number of spaces
(\"[^\"]*\"|[^=,) }]*)
: Anything enclosed with "
s, or any number of non-=,) }
characters