Search code examples
javaregexperformancenon-greedy

Java, poor regex performance with lazy expressions


The code is actually in Scala (Spark/Scala) but the library scala.util.matching.Regex, as per the documentation, delegates to java.util.regex.

The code, essentially, reads a bunch of regex from a config file and then matches them against logs fed to the Spark/Scala app. Everything worked fine until I added a regex to extract strings separated by tabs where the tab has been flattened to "#011" (by rsyslog). Since the strings can have white-spaces, my regex looks like: (.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)#011(.+?)

The moment I add this regex to the list, the app takes forever to finish processing logs. To give you an idea of the magnitude of delay, a typical batch of a million lines takes less than 5 seconds to match/extract on my Spark cluster. If I add the expression above, a batch takes an hour!

In my code, I have tried a couple of ways to match regex:

  1. if ( (regex findFirstIn log).nonEmpty ) { do something }

  2. val allGroups = regex.findAllIn(log).matchData.toList if (allGroups.nonEmpty) { do something }

  3. if (regex.pattern.matcher(log).matches()){do something}

All three suffer from poor performance when the regex mentioned above it added to the list of regex. Any suggestions to improve regex performance or change the regex itself?

The Q/A that's marked as duplicate has a link that I find hard to follow. It might be easier to follow the text if the referenced software, regexbuddy, was free or at least worked on Mac.

I tried negative lookahead but I can't figure out how to negate a string. Instead of /(.+?)#011/, something like /([^#011]+)/ but that just says negate "#" or "0" or "1". How do I negate "#011"? Even after that, I am not sure if negation will fix my performance issue.


Solution

  • The simplest way would be to split on #011. If you want a regex, you can indeed negate the string, but that's complicated. I'd go for an atomic group

    (?>(.+?)#011)
    

    Once matched, there's no more backtracking. Done and looking forward for the next group.

    Negating a string

    The complement of #011 is anything not starting with a #, or starting with a # and not followed by a 0, or starting with the two and not followed... you know. I added some blanks for readability:

     ((?: [^#] | #[^0] | #0[^1] | #01[^1] )+) #011
    

    Pretty terrible, isn't it? Unlike your original expression it matches newlines (you weren't specific about them).

    An alternative is to use negative lookahead: (?!#011) matches iff the following chars are not #011, but doesn't eat anything, so we use a . to eat a single char:

     ((?: (?!#011). )+)#011
    

    It's all pretty complicated and most probably less performant than simply using the atomic group.

    Optimizations

    Out of my above regexes, the first one is best. However, as Casimir et Hippolyte wrote, there's a room for improvements (factor 1.8)

    ( [^#]*+ (?: #(?!011) [^#]* )*+ ) #011
    

    It's not as complicated as it looks. First match any number (including zero) of non-# atomically (the trailing +). Then match a # not followed by 011 and again any number of non-#. Repeat the last sentence any number of times.

    A small problem with it is that it matches an empty sequence as well and I can't see an easy way to fix it.