I need find the last index of set of characters in a string. Consider the set of characters be x,y,z and string as Vereador Luiz Pauly Home then I need index as 18.
So for finding the index I have created a pattern with DOTALL flag and greedy quantifier as (?s).*(x|y|z). When the pattern is applied to that string(multiline), I can find out index from the start group. The code:
int findIndex(String str){
int index = -1;
Pattern p = Pattern.compile("(?s).*(x|y|z)");
Matcher m = regex.matcher(str);
if(m.find()){
index = m.start(1);
}
return index;
}
As expected it is returning the values correctly, if there is match.
But if there is no match, then it takes too long time (17 minutes for 600000 characters) as it is a Greedy match.
I tried with other quantifiers, but can't get the desired output. So can anyone refer any better regex?
PS: I can also think about traversing the content from last and finding the index.But I hope there is some better way in regex which can do the job quickly.
There are few ways to solve the problem and the best way will depend on the size of the input and the complexity of the pattern:
Reverse the input string and possibly the pattern, this might work for non-complex patterns. Unfortunately java.util.regex
doesn't allow to to match the pattern from right to left.
Instead of using a greedy quantifier simply match the pattern and loop Matcher.find()
until last occurrence is found.
Use a different regex engine with better performance e.g. RE2/J: linear time regular expression matching in Java.
If option 2 is not efficient enough for your case I'd suggest to try RE2/J:
Java's standard regular expression package, java.util.regex, and many other widely used regular expression packages such as PCRE, Perl and Python use a backtracking implementation strategy: when a pattern presents two alternatives such as
a|b
, the engine will try to match subpatterna
first, and if that yields no match, it will reset the input stream and try to matchb
instead.If such choices are deeply nested, this strategy requires an exponential number of passes over the input data before it can detect whether the input matches. If the input is large, it is easy to construct a pattern whose running time would exceed the lifetime of the universe. This creates a security risk when accepting regular expression patterns from untrusted sources, such as users of a web application.
In contrast, the RE2 algorithm explores all matches simultaneously in a single pass over the input data by using a nondeterministic finite automaton.