Search code examples
pythonregexregex-group

Python Regex: non capturing group is captured


I came up with these two regex patterns

1.

\([0-9]\)\s+([^.!?]+[.!?])

2.

[.!?]\s+([A-Z].*?[.!?])

To match sentences in strings like these:

(1) A first sentence, that always follows a number in parantheses. This is my second sentence. This is my third sentence, (...) .

Thanks to your answers I archived to get the intro sentence after the number in parantheses. I as well get the 2nd sentence with my 2nd regex.

However the third sentence is not captured, since the . was consumed before. My goal is to get the start point of these sentences by two methods:

  1. Getting the "intro" sentence by capturing the start after (1)
  2. Getting any other sentence by recognizing the dot, a whitespace and a Capital letter after it.

How can I avoid the matching to fail for the 3rd and following sentences?

Thanks for any help!


Solution

  • You could use a capturing group with a negated character class [^ If you want to match 1 or more digits you could use [0-9]+

    \([0-9]\)\s+([^.!?]+[.!?])
    
    • \([0-9]\) Match a digit between parenthesis
    • \s+ Match 1+ whitespace chars
    • ( Capture group 1
      • [^.!?]+[.!?] Match 1+ times any char other than ., !,?. Then match one of them.
    • ) Close group

    Regex demo | Python demo

    For example

    import re
    
    regex = r"\([0-9]\)\s+([^.!?]+[.!?])"
    test_str = "(1) This is my first sentence, it has to be captured. This is my second sentence."
    
    print(re.findall(regex, test_str))
    

    Output

    ['This is my first sentence, it has to be captured.']
    

    If you want to match the other sentences as well and be able to differentiate between the first sentence and the others, you might use an alternation with another capturing group

    (?:\([0-9]\)\s+([^.!?]+[.!?])|([A-Z].*?\.)(?: |$))
    

    Regex demo