Search code examples
regexregex-lookaroundslookbehind

How does the regular expression ‘(?<=#)[^#]+(?=#)’ work?


I have the following regex in a C# program, and have difficulties understanding it:

(?<=#)[^#]+(?=#)

I'll break it down to what I think I understood:

(?<=#)    a group, matching a hash. what's `?<=`?
[^#]+     one or more non-hashes (used to achieve non-greediness)
(?=#)     another group, matching a hash. what's the `?=`?

So the problem I have is the ?<= and ?< part. From reading MSDN, ?<name> is used for naming groups, but in this case the angle bracket is never closed.

I couldn't find ?= in the docs, and searching for it is really difficult, because search engines will mostly ignore those special chars.


Solution

  • They are called lookarounds; they allow you to assert if a pattern matches or not, without actually making the match. There are 4 basic lookarounds:

    • Positive lookarounds: see if we CAN match the pattern...
      • (?=pattern) - ... to the right of current position (look ahead)
      • (?<=pattern) - ... to the left of current position (look behind)
    • Negative lookarounds - see if we can NOT match the pattern
      • (?!pattern) - ... to the right
      • (?<!pattern) - ... to the left

    As an easy reminder, for a lookaround:

    • = is positive, ! is negative
    • < is look behind, otherwise it's look ahead

    References


    But why use lookarounds?

    One might argue that lookarounds in the pattern above aren't necessary, and #([^#]+)# will do the job just fine (extracting the string captured by \1 to get the non-#).

    Not quite. The difference is that since a lookaround doesn't match the #, it can be "used" again by the next attempt to find a match. Simplistically speaking, lookarounds allow "matches" to overlap.

    Consider the following input string:

    and #one# and #two# and #three#four#
    

    Now, #([a-z]+)# will give the following matches (as seen on rubular.com):

    and #one# and #two# and #three#four#
        \___/     \___/     \_____/
    

    Compare this with (?<=#)[a-z]+(?=#), which matches:

    and #one# and #two# and #three#four#
         \_/       \_/       \___/ \__/
    

    Unfortunately this can't be demonstrated on rubular.com, since it doesn't support lookbehind. However, it does support lookahead, so we can do something similar with #([a-z]+)(?=#), which matches (as seen on rubular.com):

    and #one# and #two# and #three#four#
        \__/      \__/      \____/\___/
    

    References