Given this string
xxv jkxxxxxxxxxxxxxxx xxyu xxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxAp oSxx
xxAp oSxxxxxxxxxxxxxxxxxxxxxj xxxxxxxxxuixxxxxxxxxxx axxxxxxxxxxxxxxxxxx
and this regex
^[^\r\n]*Ap oS[^\r\n]*
I am looking to match any line that contains Ap oS
anywhere, as shown here, and it does that.
Now, by looking at the debugger one can see that the first match took 16 steps and the second 80, because of backtracking, if I understand correctly.
My question is, how can this regex be written to lower the amount of steps?
I thought of replacing the first [^\r\n]*
with (?!Ap oS)*
to match everything that is not Ap oS
, until it finds Ap oS
, but I am not sure if I am getting the concept or the syntax wrong, or both.
Any help is appreciated
You can apply the unroll the loop technique in a more simple and efficient way with one of these patterns:
^(?:[^A\r\n]*A)+?p oS.*
(Note that pcre makes the quantifier *
possessive automatically in [^A\r\n]*A
since A
follows a repeated character class from where it is excluded. In other words, replacing the non-capturing group around this subpattern with an atomic group or making explicitly the quantifier possessive is useless.)
or if you want that the literal part fully appears:
^[^A\r\n]*+(?>A[^A\r\n]*)*?Ap oS.*
You don't need to use a lookahead since it is what the reluctant quantifier does here (it tests the next subpattern after each group iteration).
Note that since you are looking for lines containing a literal string, and if your data comes from a file, it may be more interesting in many programming languages to simply read the file line by line and to filter them with a basic string function.
Depending of the default newline sequence in pcre and the one used in your string, .*
at the end of a pattern can match the carriage return. To avoid this behavior, you can explicitely set what is the newline sequence, starting your pattern with (*CRLF)
.
Reducing the number of steps for a pattern is one way to make a pattern more efficient, but take care to not build a too long or too complicated pattern only for this purpose, since it can also be counter productive.