Search code examples
javascriptregexnon-greedy

How to make \w token act non-greedy in this RegEx?


I have a text string which contains a repeating pattern, each repetion separated by the next by the . (dot) character. The pattern may end in a _123 (underscore followed by a sequence of digits), and I want to catch those digits in a dedicated capturing group.

The RegEx (ECMAScript) I have built mostly works:
https://regex101.com/r/iEzalU/1

/(label(:|\+))?(\w+)(?:_(\d+))?/gi

However, the (\w+) part acts greedy, and overtakes the (?:_(\d+))? part.

Regex with Greedy behavior

Adding a ? to make \w+ non-greedy (\w+?) works, but now I have a capturing token for each character matched by \w

Regex with non-greedy behavior

How can I make this regex such that \w+ acts greedy but still does not overtake the _(\d+) part?
Otherwise, is it possible to capture all tokens matched by the non-greedy \w+?, as a single match? (some capturing/non-capturing groups magic?)


Solution

  • When creating regular expressions, it is a good idea to think about your expected match boundaries.

    You know you need to match substrings in a longer string, so $ and \z can be excluded at once. Digits, letters, underscores are all word characters matched with \w, so you want to match all up to a character other than a word character (or, potentially, till the end of string).

    I suggest using

    (label[:+])?(\w+?)(?:_(\d+))?\b
    

    See the regex demo

    Details:

    • (label[:+])? - an optional Group 1: label and then a : or +
    • (\w+?) - Group 2: one or more word chars as few as possible
    • (?:_(\d+))? - an optional sequence of: _ and then one or more digits captured into Group 3
    • \b - the next char can only be a non-word char or end of string should follow.