Search code examples
regextokenizepcre

Tokenizing a string with a regular expression


Suppose I have a string like this: abc def ghi jkl (I put a space at the end for the sake of simplicity but it doesn't really matter for me) and I want to capture its "chunks" as follows:

abc

def

ghi

jkl

if and only if there are 1-4 "chunks" in the string. I have already tried the following regex:

^([^ ]+ ){1,4}$

at Regex101.com but it only captures the last occurrence. A warning about it is issued:

A repeated capturing group will only capture the last iteration. Put a capturing group around the repeated group to capture all iterations or use a non-capturing group instead if you're not interested in the data

How to correct the regular expression to achieve my goal?


Solution

  • Since you have no access to the code, the only solution you might use is a regex based on the \G operator that will only allow consecutive matches and a lookahead anchored at the start that will require 1 to 4 non-whitespace chunks in the string.

    (?:^(?=\s*\S+(?:\s+\S+){0,3}\s*$)|\G(?!^))\s*\K\S+
    

    See the regex demo

    Details:

    • (?:^(?=\s*\S+(?:\s+\S+){0,3}\s*$)|\G(?!^)) - a custom boundary that checks if:

      • ^(?=\s*\S+(?:\s+\S+){0,3}\s*$) - the string start position (^) that is followed with 1 to 4 non-whitespace chunks, separated with 1+ whitespaces, and trailing/leading whitespaces are allowed, too
      • | - or
      • \G(?!^) - the current position at the end of the previous successful match (\G also matches the start of a string, thus we have to use the negative lookahead to exclude that matching position, since there is a separate check performed)
    • \s* - zero or more whitespaces

    • \K - a match reset operator discarding all the text matched so far
    • \S+ - 1 or more characters other than whitespace