Search code examples
pythonregexpython-re

How to capture the rest of a sentence after one or two matching groups with regex?


So i have two sentences that I'm working with and I'm interested in making specific capture groups based on the characters in a word. So i have these two spanish sentences:

  1. Yo quiero irme de viaje.
  2. Yo puedo caminar en la nieve.

The first capture group has to be one of the verbs ie. "quiero" and "puedo" so i do that with this regex ([PpDdQq].*o).
The second capture group has to be a word following directly after the verb, ending in "me" and I do that with (\w*me).
Now for the last capture group,it has to be all words and blankspaces following directly after the first capture group in the absence of a direct word ending in "-me" or all words and blankspaces following directly after the second capture group in the presence of a direct word ending in "-me", I used (\w.+) but it didn't work.

Could anybody help me figure out why? Thanks. Below is the full regex and link to regex website containing the expression and examples to be matched:

([PpDdQq].*o) |(\w*me)|(\w.+)


Solution

  • Use

    \b([PpDdQq]\w*o)(?:\s+(\w*me))?\b(.*)
    

    See regex proof.

    EXPLANATION

    --------------------------------------------------------------------------------
      \b                       the boundary between a word char (\w) and
                               something that is not a word char
    --------------------------------------------------------------------------------
      (                        group and capture to \1:
    --------------------------------------------------------------------------------
        [PpDdQq]                 any character of: 'P', 'p', 'D', 'd',
                                 'Q', 'q'
    --------------------------------------------------------------------------------
        \w*                      word characters (a-z, A-Z, 0-9, _) (0 or
                                 more times (matching the most amount
                                 possible))
    --------------------------------------------------------------------------------
        o                        'o'
    --------------------------------------------------------------------------------
      )                        end of \1
    --------------------------------------------------------------------------------
      (?:                      group, but do not capture (optional
                               (matching the most amount possible)):
    --------------------------------------------------------------------------------
        \s+                      whitespace (\n, \r, \t, \f, and " ") (1
                                 or more times (matching the most amount
                                 possible))
    --------------------------------------------------------------------------------
        (                        group and capture to \2:
    --------------------------------------------------------------------------------
          \w*                      word characters (a-z, A-Z, 0-9, _) (0
                                   or more times (matching the most
                                   amount possible))
    --------------------------------------------------------------------------------
          me                       'me'
    --------------------------------------------------------------------------------
        )                        end of \2
    --------------------------------------------------------------------------------
      )?                       end of grouping
    --------------------------------------------------------------------------------
      \b                       the boundary between a word char (\w) and
                               something that is not a word char
    --------------------------------------------------------------------------------
      (                        group and capture to \3:
    --------------------------------------------------------------------------------
        .*                       any character except \n (0 or more times
                                 (matching the most amount possible))
    --------------------------------------------------------------------------------
      )                        end of \3