I am working with transcripts and having trouble with matching patterns in non-greedy fashion. It is still grabbing way too much and looks like doing greedy matches.
A transcript looks like this:
>> John doe: Hello, I am John Doe.
>> Hello, I am Jane Doe.
>> Thank you for coming, we will start in two minutes.
>> Sam Smith: [no audio] Good morning, everyone.
To find the name of speakers within >> (WHATEVER NAME):, I wrote
pattern=re.compile(r'>>(.*?):')
transcript='>> John doe: Hello, I am John Doe. >> Hello, I am Jane Doe. >> Thank you for coming, we will start in two minutes. >> Sam Smith: [no audio] Good morning, everyone.'
re.findall(pattern, transcript)
I expected 'John Doe'
and 'Sam Smith'
, but it is giving me 'John Doe'
and 'Hello, I am Jane Doe. >> Thank you for coming, we will start in two minutes. >> Sam Smith'
I am confused because .*?
is non-greedy, which (I think) should be able to grab 'Sam Smith'
. How should I fix the code so that it only grabs whatever in
>> (WHATEVER NAME):? Also, I am using Python 3.6.
Thanks!
Do you really need regex? You can split on >>
prompts and then filter out your names.
>>> [i.split(':')[0].strip() for i in transcript.split('>>') if ':' in i]
['John doe', 'Sam Smith']