Search code examples
pythonregexregex-greedypython-re

Whitespace follows by brackets (non lazy) in Python using regex


I am trying to do the following: from a list of strings extract anything before the first occurrence (there may be more than one) of a whitespace followed by a round bracket "(".

I have tried the following:

re.findall("(.*)\s\(", line))

but it gives the wring results for e.g. the following strings:

Carrollton (University of West Georgia)[2]*Dahlonega (North Georgia College & State University)[2]

Thanks in advance


Solution

  • To extract anything before the first occurrence of a whitespace char followed by a round bracket ( you may use re.search (this method is meant to extract the first match only):

    re.search(r'^(.*?)\s\(', text, re.S).group(1)
    re.search(r'^\S*(?:\s(?!\()\S*)*', text).group()
    

    See regex #1 demo and regex #2 demos. Note the second one - though longer - is much more efficient since it follows the unroll-the-loop principle.

    Details

    • ^ - start of string
    • (.*?) - Group 1: any 0+ chars as few as possible,
    • \s\( - a whitespace and ( char.

    Or, better:

    • ^\S* - start of string and then 0+ non-whitespace chars
    • (?:\s(?!\()\S*)* - 0 or more occurrences of
      • \s(?!\() - a whitespace char not followed with (
      • \S* - 0+ non-whitespace chars

    See Python demo:

    import re
    strs = ['Isla Vista (University of California, Santa Barbara)[2]','Carrollton (University of West Georgia)[2]','Dahlonega (North Georgia College & State University)[2]']
    rx = re.compile(r'^\S*(?:\s(?!\()\S*)*', re.S)
    for s in strs:
        m = rx.search(s) 
        if m:
            print('{} => {}'.format(s, m.group()))
        else:
            print("{}: No match!".format(s))