Search code examples
regexpreg-match

Why is regex being lazy instead of greedy in this case?


This is a snippet of a complex regex:

/\x87([\xA6-\xBf]|\xA6\xF0\x9F)/x

Why is it stopping and returning \x87\xA6 instead of \x87\xA6\xF0\x9F

when matching against a string containing \x87\xA6\xF0\x9F ?

I thought regex was greedy by default and would try to consume the longest pattern?

Or is that only for the * and + operators?

Is there any way I can force it to look for the longest pattern? Using word boundaries is not an option in this case unfortunately.


eta: apparently it works as desired if I move the shorter pattern to the end

/\x87(\xA6\xF0\x9F|[\xA6-\xBf])/x

is it really that simple and regex is sensitive to order of the pattern?


Solution

  • I thought regex was greedy by default and would try to consume the longest pattern?

    "Greediness" refers to the preference of the quantifiers (?, *, +, etc.) for repeating more times rather than fewer. That's not exactly the same as consuming the longest substring, though of course it usually works out that way.

    The alternation operator | also has a preference: it prefers to match what's before the |, instead of what's after it. You can fix your pattern by writing:

    /\x87(\xa6\xF0\x9F|[\xa6-\xbf])/x