Search code examples
pythonregexregex-lookaroundsregex-group

Match in whole string with punctuation (issues using \b)


Usually to match complete words we use \b as word delimiter, but when we are dealing with a compound world including punctuation, this method does not work quite well. For instance, suppose the following string:

basic school co-operative limited

If we apply the following regex we get co-operative and limited as expected. This happens due to the order in the alternators:

\b(co-operative|co|co.|limited)\b

What happens if I do not have any control over the order of regex alternators and I get the following regex?

\b(co|co.|co-operative|limited)\b

In this scenario, just co limited would match instead of co-operative limited. Do we have any way to solve the problem in the order in the alternations?

Thanks for your priceless help


Solution

  • Since you want to match complete words, you could change the \b assertion at the end of the regex to a positive lookahead for whitespace or the end of the string e.g.

    \b(co|co.|co-operative|limited)(?=\s|$)
    

    Demo on regex101

    If you wanted to allow for certain punctuation after a word, you could add that into the lookahead, e.g.

    \b(co|co.|co-operative|limited)(?=[\s.]|$)