Given the 2 strings:
l = ['作主 (zuòzhǔ)', '座右铭 (座右銘, zuòyòumíng)']
The desired output is:
('作主', None, 'zuòzhǔ')
('作主', '座右銘', 'zuòyòumíng')
Ive tried to extract the groups as such, but I'm unable to split the 座右銘, 'zuòyòumíng
into 2 groups:
l = ['作主 (zuòzhǔ)', '座右铭 (座右銘, zuòyòumíng)']
word = re.search(r'(.*)\s\((.*?)\)', l[0])
sim = word.group(1)
try:
pinyin = word.group(3)
trad = word.group(2)
except:
pinyin = word.group(2)
trad = None
print (sim, trad, pinyin)
I could do this:
try:
pinyin = word.group(3)
trad = word.group(2)
except:
trad, pinyin = word.group(2).split(', ')
But can the comma split be done within the regex?
I've also tried this but it still capture the whole string within the .*?
:
(.*)\s\((.*?[,][\s].*?)\)
You could use the following regex:
(.*?) \((?:(.*?), )?(.*?)\)
The only difference is the optional non-capturing group containing the part before the comma: (?:(.*?), )?
.
In [4]: re.search(r'(.*?) \((?:(.*?), )?(.*?)\)', '座右铭 (座右銘, zuòyòumíng)').groups()
Out[4]: ('座右铭', '座右銘', 'zuòyòumíng')
In [5]: re.search(r'(.*?) \((?:(.*?), )?(.*?)\)', '作主 (zuòzhǔ)').groups()
Out[5]: ('作主', None, 'zuòzhǔ')