Search code examples
pythonregexstringunicodedelimiter

Extract groups within parenthesis when there is an optional substring delimiter


Given the 2 strings:

l = ['作主 (zuòzhǔ)', '座右铭 (座右銘, zuòyòumíng)']

The desired output is:

('作主', None, 'zuòzhǔ')
('作主', '座右銘', 'zuòyòumíng')

Ive tried to extract the groups as such, but I'm unable to split the 座右銘, 'zuòyòumíng into 2 groups:

l = ['作主 (zuòzhǔ)', '座右铭 (座右銘, zuòyòumíng)']
word = re.search(r'(.*)\s\((.*?)\)', l[0])

sim = word.group(1)
try:
    pinyin = word.group(3)
    trad = word.group(2)
except:
    pinyin = word.group(2)
    trad = None

print (sim, trad, pinyin)

I could do this:

try:
    pinyin = word.group(3)
    trad = word.group(2)
except:
    trad, pinyin = word.group(2).split(', ')

But can the comma split be done within the regex?

I've also tried this but it still capture the whole string within the .*?:

(.*)\s\((.*?[,][\s].*?)\)

Solution

  • You could use the following regex:

    (.*?) \((?:(.*?), )?(.*?)\)
    

    The only difference is the optional non-capturing group containing the part before the comma: (?:(.*?), )?.

    In [4]: re.search(r'(.*?) \((?:(.*?), )?(.*?)\)', '座右铭 (座右銘, zuòyòumíng)').groups()
    Out[4]: ('座右铭', '座右銘', 'zuòyòumíng')
    
    In [5]: re.search(r'(.*?) \((?:(.*?), )?(.*?)\)', '作主 (zuòzhǔ)').groups()
    Out[5]: ('作主', None, 'zuòzhǔ')