I have a text string which contains a repeating pattern, each repetion separated by the next by the .
(dot) character. The pattern may
end in a _123
(underscore followed by a sequence of digits), and I want to catch those digits in a dedicated capturing group.
The RegEx (ECMAScript) I have built mostly works:
https://regex101.com/r/iEzalU/1
/(label(:|\+))?(\w+)(?:_(\d+))?/gi
However, the (\w+)
part acts greedy, and overtakes the (?:_(\d+))?
part.
Adding a ?
to make \w+
non-greedy (\w+?)
works, but now I have a capturing token for each character matched by \w
How can I make this regex such that \w+
acts greedy but still does not overtake the _(\d+)
part?
Otherwise, is it possible to capture all tokens matched by the non-greedy \w+?
, as a single match? (some capturing/non-capturing groups magic?)
When creating regular expressions, it is a good idea to think about your expected match boundaries.
You know you need to match substrings in a longer string, so $
and \z
can be excluded at once. Digits, letters, underscores are all word characters matched with \w
, so you want to match all up to a character other than a word character (or, potentially, till the end of string).
I suggest using
(label[:+])?(\w+?)(?:_(\d+))?\b
See the regex demo
Details:
(label[:+])?
- an optional Group 1: label
and then a :
or +
(\w+?)
- Group 2: one or more word chars as few as possible(?:_(\d+))?
- an optional sequence of: _
and then one or more digits captured into Group 3\b
- the next char can only be a non-word char or end of string should follow.