Search code examples
regexregex-lookaroundsword-boundary

What is the equivalent regex expression \b written using ^ and $?


How can I rewrite my anchor to be more general and correct in all situations? I have understood that using \b as an anchor is not optimal because it is implementation-dependent.

My goal is to match some type of word in a text file. For my question, the word to match is not of importance.

Assume \b is the word boundary anchor and a word character is [a-zA-Z0-9_] I constructed two anchors, one for the left and one for the right side of the regex. Notice how I handle the underscore, as I don't want it to be a word character when I read my text file.

  • (?<=\b|_) positive lookbehind
  • (?=\b|_) positive lookahead

What would be the equivalent anchor constructs but using the more general caret ^ and $ dollar sign to get the same effect?


Solution

  • [The OP did not specify which regex language they are using. This answer uses Perl's regex language, but the final solution should be easy to translate into other languages. Also, I use whitespace as if the x flag was provided, but that is also easily adjusted.]


    With the help of a comment made by the OP, the following is my understanding of the question:

    I have something like \b\w+\b, but I want to exclude _ the definition of a word.

    You can use the following:

    (?<! [^\W_] ) [^\W_]+ (?! [^\W_] )
    

    An explanation follows.


    \b is equivalent to (?: (?<!\w)(?=\w) | (?<=\w)(?!\w) ).

    \b \w+ \b is therefore equivalent to (?<!\w) \w+ (?!\w) (after simplification).

    So now we just need a pattern that matches everything \w matches but _. There are a few approaches that can be taken.

    • Set difference: (?[ \w - [_] ])
    • Look-ahead: (?!_)\w
    • Look-behind: \w(?<!_)
    • Double negation: [^\W_]

    Even though it's the least readable, I'm going to use the last one since it's the best supported.

    We now have

    (?<! [^\W_] ) [^\W_]+ (?! [^\W_] )