Search code examples
rregexregex-lookaroundsregex-groupregex-greedy

RegEx for splitting a list of words with multiple capturing groups


I have the following string:

one two three four five six seven eight nine

And I am trying to construct a regular expression that groups the string into three groupings:

  1. Group 1: 'one two three'
  2. Group 2: 'four five six'
  3. Group 3: 'seven eight nine'

I have tried variations of (.*\b(one|two|three)?)(.*\b(four|five|six)?)(.*\b(seven|eight|nine)?) but this pattern splits the full match into one group that contains the full string - the demo can be found here.

Trying (.*\b(one|two|three))(.*\b(four|five|six))(.*\b(seven|eight|nine)) seems to get me closer to what I want but the match information panel shows that the pattern identifies two matches each containing six capture groups.

I am using the OR statement because the groups can be of any length, e.g. two three four, applying the pattern to this string should identify two groups -

  1. Group 1: 'two'
  2. Group 2: 'three four'.

Solution

  • A large regex that probably does it

    (?=.*\b(?:one|two|three|four|five|six|seven|eight|nine)\b)(\b(?:one|two|three)(?:\s+(?:one|two|three))*\b)?.+?(\b(?:four|five|six)(?:\s+(?:four|five|six))*\b)?.+?(\b(?:seven|eight|nine)(?:\s+(?:seven|eight|nine))*\b)?
    

    https://regex101.com/r/rUtkyU/1

    Readable version

     (?=
          .* \b 
          (?:
               one
            |  two
            |  three
            |  four
            |  five
            |  six
            |  seven
            |  eight
            |  nine
          )
          \b 
     )
     (                             # (1 start)
          \b   
          (?: one | two | three )
    
          (?:
               \s+ 
               (?: one | two | three )
          )*
          \b 
     )?                            # (1 end)
    
     .+? 
     (                             # (2 start)
          \b        
          (?: four | five | six )
    
          (?:
               \s+ 
               (?: four | five | six )
          )*
          \b     
     )?                            # (2 end)
    
     .+?   
     (                             # (3 start)
          \b          
          (?: seven | eight | nine )
    
          (?:
               \s+ 
               (?: seven | eight | nine )
          )*
          \b   
     )?                            # (3 end)