Search code examples
javaregexperformancecpu-usage

Optimizing CPU Usage in Java Regex Matching


I encountered a performance issue related to regex matching in my Java project, which resulted in high CPU usage. Despite my attempts at regex optimization, I'm still experiencing performance problems, particularly when multiple threads concurrently invoke the code.

For instance, when calling a method that performs regex matching concurrently 100 times, the CPU usage spikes to 90% for a brief period.

Here's a simplified example of my code:

String BUY_PATTERN =".*\\b(purchase)\\b.*";

private static boolean isMatchPattern(String pattern, String text) {
     return text.matches(BUY_PATTERN);
}

I would like to reduce the CPU usage during regex matching. Can you provide suggestions for more efficient regex patterns that achieve the same functionality?

Additionally, I came across an article (link provided) discussing the impact of backtracking on performance, but I find it challenging to rewrite the regex to minimize backtracking.

Thank you for your assistance!


Solution

  • There are a few changes you can make.

    You can greatly reduce the CPU consumption by creating re-usable objects.

    public class Example {
        Pattern pattern = Pattern.compile("\\bpurchase\\b");
        Matcher matcher;
    
        private boolean isMatchPattern(String text) {
            matcher = pattern.matcher(text);
            return matcher.find();
        }
    }
    

    Behind the scenes, upon each call of String.matches, a new Pattern and Matcher object is created.

    To combat this, you can create Pattern and Matcher fields within your class.
    Then, access these fields from within your isMatchPattern method.

    Furthermore, for the regular expression pattern, there is no need to capture the text "purchase", so you can remove the parentheses.

    Additionally, the context of a Pattern pattern is non-conforming; it's expected to be anywhere within the text.
    As opposed to a String.matches call, which requires the entire parameter to match.
    So, you don't need the starting and ending .*, as they are redundant.

    In regard to using an String.indexOf, or String.contains.

    If you require the word-boundary check, then this is somewhat out of the question in terms of an idiomatic approach, as you'd have to make more than one call.

    If you don't require the check, then this would be the way to go.

    As a final solution, you can create a character array for-loop, which is more or less what the Matcher class does.